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OPTIMIZATION  AND  STANDARDIZATION 


OF  INFORMATION  RETRIEVAL  LANGUAGE  AND  SYSTEMS 


SUMMARY 


The  studies  described  in  this  report  have  been  aimed  primarily  at 
analyzing  the  organization  of  oata  files  in  the  document  retrieval  applica¬ 
tion,  these  being  contained  in  Part  I.  As  a  byproduct  of  a  number  of 
analyses  conducted  cn  a  sample  of  38,402  DDC  (formerly  ASTIA)  documents, 
many  term  association  statistics  have  been  developed.  These  are  presented 
in  Part  II,  together  with  a  discussion  of  the  implications  of  association 
data  on  file  design  and  use. 

A.  ORGANIZATION  OF  DOCUMENT  RETRIEVAL  INDEX  FILES 

One  proposed  type  of  index  file  organization  is  the  Multi-  .ist  System, 
a  variation  of  the  conventional  list-organized  file  in  which  the  chains  or 
lists  are  based  u|  n  groups  of  two  or  three  index  terms  rather  than  just  one. 
The  implications  and  effects  of  this  proposal  have  been  investigated  by  a 
series  of  computer  programs  simulating  the  establishment  and  maintenance  of 
the  files,  using  as  data  base  the  600  most  common  index  terms  in  the  DDC 
sample.  The  results  indicate  that  a  large  amount  of  processing,  against  an 
extensive  data  base,  is  necessary  to  accomplish  the  grouping  and  that  the 
desired  objective  is  not  met — most  documents  have  almost  as  many  groups  as 
index  terms  and  the  postulated  reduction  in  lists  traversing  a  given  docu¬ 
ment  cannot  be  realized.  It  is  concluded  that  the  Multi-List  System  does 
not  offer  an  efficient  approach  to  the  organization  of  a  document  retrieval 
file. 


One  proposed  variation  of  the  document-sequenced  file  orders  it  on  the 
lowest  index  term  code  included  in  each  document  description,  rather  than  in 
straight  accession  number  order.  The  intent  is  to  reduce  the  portion  of  the 
file  searched  by  eliminating  documents  which  cannot  have  term  codes  included 
in  a  request.  Although  this  approach  is  somewhat  preferable  to  the  conven¬ 
tional  document-sequence  file,  evaluation  indicates  that  reduction  is  not 
enough  to  make  it  an  efficient  method. 

Finally,  the  list-organized  file  technique  is  analyzed  and  compared 
wi*h  <:he  inverted  and  document-sequenced  files.  System  requirements  which 
can  be  met  with  an  inverted  file  are  described,  together  with  those  which 
require  access  to  a  document  record.  Analysis  shows  that  the  list-organized 
file  is  an  amalgamation  of  the  inverted  and  document-sequenced  files.  It  is 
concluded  that  maintenance  and  use  of  the  two  separate  files  is  more  efficient 
than  the  list-organized  technique  when  requirements  cannot  be  met  by  the 
inverted  file  alone.  A  technique  for  the  optimum  detail  organization  of  the 
two  fiies,  by  which  both  actual  computing  and  over-all  elapsed  processing 
times  can  be  minimized,  is  described. 
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B.  TERM  ASSOCIATIONS  IN  DOCUMENT  RETRIEVAL 


The  documents  in  the  DDC  sample  generate  a  large  number  of  different 
pair  associations  of  index  turms,  most  of  which  occur  only  one  or  two  times. 

In  general,  individual  terms  form  many  different  pairs  and  the  number  increases 
with  total  frequency  of  usage.  Terms  within  one  DDC  thesaurus  group  have  a 
high  probability  of  forming  pairs  and  these  tend  to  occur  frequently.  A 
lesser  tendency,  still  pronounced,  is  observed  for  terms  within  ore  field  of 
interest.  However,  85$  of  all  pairs  involve  terms  in  two  different  fields. 
There  is  no  pronounced  evidence  that  index  term  usage  can  be  predicted  upon, 
or  is  highly  correlated  to,  the  structural  hierarchy  of  the  thesaurus.  A 
number  of  tables  summarising  pair  association  data  in  the  sample  are  included. 

The  significance  of  the  high  percentage  of  pairs  which  occur  only  a  few 
times  is  questioned,  whether  or  not  such  occurrences  statistically  can  be 
interpreted  as  representing  more  than  random  associations.  Some  implications 
are  discussed  of  using  associations  involving  terms  of  broad  scope  or  wide 
applicability.  It  is  considered  that  there  is  potential  application  of  using 
relationships  implicit  in  the  hierarchal  structure  of  a  thesaurus,  both  in 
processing  search  requests  and  in  aiding  the  describing  of  documents  by  such 
techniques  as  "lowest  level  indexing." 

Analysis  of  the  DDC  data  indicates  that  the  use  of  only  a  fe;v  hundred 
documents  as  data  base  for  term  association  studies  generates  reUitionships 
not  representative  of  the  library  as  a  whole.  Conclusions  derived  from  these 
small  samples  can  be  highly  misleading,  particularly  if  the  documents  are 
limited  to  one  subject  area.  It  is  believed  that  meaningful  studies  require 
a  data  base  cf  at  least  several  thousand  documents. 

The  use  of  term  associations  is  considered  to  have  Definite  potential 
in  document  retrieval.  However,  the  determination  of  significant  associations, 
the  use  of  thesaurus-implicit  relationships  in  both  indexing  and  searching, 
and  the  processing  techniques  and  requirements  for  incorporating  term  associa¬ 
tions  into  an  operative  system,  all  are  deemed  to  be  areas  for  further  in¬ 
vestigation. 


I.  ANALYSES  INTO  METHODS  OF  INDEX  TERM  FILE  ORGANIZATION 


In  an  IS^R  application,  documents  are  described  by  a  variable  number  of 
index  terms.  Usually,  the  describing  terms  are  taken  from  a  controlled 
thesaurus  of  allowable  terms,  with  their  definitions,  although  sometimes  an 
uncontrolled  thesaurus-equivalent  to  free-language  indexing--is  used.  In 
either  case,  the  document  numbers  and  associated  index  terns  must  be  set  up 
in  a  file  which  is  the  data  bank  against  which  search  requests  are  processed. 

There  are  four  basic  ways  in  which  this  document  number  index  file  can 
be  organized: 

a.  Document  Number  Sequence,  in  which  the  document  number  is  the 
record  identifer  and  the  associated  index  terms  comprise  the  body 
of  the  record.  Every  record  in  the  file  must  be  processed  against 
the  logical  relationships  of  index  terms  in  a  search  request. 
Although  the  file  is  usually  set  up  in  document  number  sequence, 
other  orders  are  permissible  and  search  requests  can  be  processed 
against  a  completely  random  file. 

b.  Inverted  Sequence,  in  which  the  index  term  is  the  record  identifier 
and  the  document  numbers  in  which  it  appears  comprise  the  body  of 
the  record.  Processing  of  a  search  request  requires  accessing  only 
the  index  terms  it  contains,  the  document  numbej s  pertinent  oeing 
selected  on  the  basis  of  the  logical  relationships  connecting  the 
terms  of  the  request.  Inverted-sequence  files  usually  are  set  up 
in  sequence  on  index  term  identifying  numbers. 


c. 


Document  Number  Sequence,  List  Organized,  in  which  each  index  term 
associated  with  a  document  is  ’’chained"  to  another  document  de¬ 
scribed  by  that  term.  The  "chain  address"  can  be  either  a  document 
or  its  location  in  the  file.  A  separate  entry  table  contains  the 
document  number  or  its  address  (file  location)  of  the  first  docu¬ 
ment  using  each  index  term.  The  chain  addresses  permit  traversing 
all  documents  containing  an  index  term,  each  single  document 
specifying  another  in  the  "chain."  Such  a  file  is  said  to  be  "list- 
organized,"  each  index  term  comprising  a  "list"  which  is  entered 
via  the  entry  table  and  traced  through,  document  by  document,  using 
the  chain  addresses.  A  document  belongs  to  many  "lists"  as  it 
has  index  terms.  Ir.  processing  a  search  request  whose  index  terms 
are  connected  by  logical  "and"  relationship,  one  term  is  selected 
and  only  the  documents  in  its  "list"  examined  to  determine  if  the 
terms  describing  each  one  meet  the  criteria  of  the  request..  If  the 
file  is  maintained  on  a  random  access  (mass  storage)  device,  it  need 
not  be  in  document  number  sequence:  the  "chain  addresses"  can  jump 
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back  and  forth  through  the  t <"> t a i  Tf  stored  on  magnetic  taoe 

or  other  sequential  access  devices,  the  file  is  set  up  in  document 
number  sequence  and  the  "chain  addresses"  .jump  forward,  not  backward. 

d.  Superimposed  Coding,  in  which  index  terms  are  denoted  b>  randomly 
selected  codes  in  a  fixed-length  field,  usually  binary,  and  the 
document  description  is  created  by  logical  superimposition  of  the 
codes  for  the  index  terms  it  contains.  Each  code  may  be,  for 
example,  five  random  1-bits  in  an  80-bit.  field;  the  final  super¬ 
imposed  code  contains  a  1-bit  in  every  position  in  which  any  one  or 
more  of  the  constituent  index  term  codes  has  a  1-bit.  This  type  of 
code  may  replace,  or  be  generated  in  addition  to,  the  detail  index 
terms  involved;  the  record  key  is  the  document  number.  Different 
combinations  of  index  terms  can  generate  the  same  superimposed  code 
and,  as  a  result,  retrievals  may  include  some  nonpertinent  documents. 
The  percentage  of  this  "noise"  can  be  kept  below  any  desired  level 
by  appropriate  selection  of  field  size  and  number  of  i-bits  in  each 
code.  In  this  type  of  file  organization,  every  record  must  be 
examined  in  processing  a  search  request.  In  most  cases,  however, 
a  document  can  be  accepted  or  rejected  with  many  fewer  comparisons 
than  are  required  for  the  conventional  document-sequenced  file. 


A.  COMMENTS  ON  METHODS  OF  FILE  ORGANIZATION 


The  second  and  third  types  of  file  organization  are  those  which  have 
been  studied  most  intensively  in  applying  electronic  data-processing  equip¬ 
ment  to  IS«R  applications.  In  actual  operative  systems,  the  inverted  file 
probably  is  the  most  common  form  of  file  organization,  although  some  magnetic 
tape  applications  use  a  document-number  sequence  file.  The  list-organized 
file  appears  to  be  considered  suitable  primarily  when  a  mass-storage,  random- 
access  device  is  postulated.  It  is,  however,  completely  feasible  when 
magnetic  tape  is  used.  Superimposed  coding  has  had  the  least  consideration 
and,  in  operative  systems,  appears  to  be  restricted  to  manual  operations 
with  files  maintained  on  edge-notched  or  punch  cards,  or  similar  storage 
media. 

There  seems  to  be  general  agreement  that,  of  the  first  three  types,  the 
document-sequenced  file  is  markedly  inferior  to  either  of  the  other  two. 

The  necessity  for  inspecting  every  document  record  in  processing  a  search 
request  entails  a  comparison  work  load  (matching  index  terms  against  those 
of  the  search  request)  two  or  more  orders  of  magnitude  greater  than  with 
either  an  inverted  or  list-organized  file.  This  factor  normally  makes  it 
unattractive  for  batch  processing  of  search  requests  using  any  type  of 
current  equipment  with  multiprocessing  capabilities.  The  other  two  use 
much  less  internal  processing  time,  even  though  the  list-organized  file 
essentially  doubles  the  amount  of  data  to  be  transferred  into  the  computer 
memory.  In  practice,  a  search  through  a  document-sequenced  file  almost 
always  is  a  badly  tape-limited  computer  operation;  the  index  terms  in  each 
of  the  several  (one  or  more)  requests  must  be  matched  against  those  of  each 
file  record  until  rejection  or  acceptance  occurs.  Even  though  rejection 
(the  common  disposition)  frequently  occurs  fairly  early  in  this  matching 
process,  the  total  comparison  time  normally  is  several  times  longer  than 
the  actual  tape-to-memory  transfer  time  of  a  record.  For  this  reason, 
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there  is  no  particular  advantage  in  storing  a  document -sequenced  file  on  a 
mass-storage  device  rather  than  on  magnetic  tape;  internal  computing,  not 
data  transfers,  governs  the  total  processing  time.  In  a  real-time  IS-<R 
application,  this  type  of  file  organization  obviously  is  inapplicable. 

The  inverted  sequence  file  has  the  advantage  of  a  small  number  of 
records--one  for  each  index  term  in  the  thesaurus.  Typically,  this  is  on 
the  order  of  a  few  thousand,  whereas  documents  are  numbered  in  the  tens  of 
thousands.  The  records  are  highly  variable  in  length;  some  index  terms 
appear  in  only  a  single  document  description  while  others  are  used  in 
thousands  of  them.  This  type  of  file  has  two  basic  implications  in  the 
processing  of  search  requests. 

First,  the  only  records  examined  are  those  for  the  index  terms  in  the 
search  request  (or  requests  in  batch  processing)  and  the  output  is  limited 
to  a  list  of  document  numbers  satisfying  the  request  logic.  Other  index 
terms  used  to  describe  the  selected  documents  are  not  included  and  cannot 
easily  be  obtained  from  an  inverted  file.  Their  omission  removes  one 
possible  means  for  quickly  determining  document  pertinency.  Second,  all 
document  numbers  for  an  index  term  must  be  matched  against  those  carried 
forward  to  the  current  stage  of  processing.  Thus  an  entire  record  of 
several  hundred  document  numbers  may  have  to  be  scanned  to  find  the  “matches" 
against  a  relative  handful  which  so  far  have  satisfied  the  request  criteria; 
this  may  be  repeated  for  several  more  index  terms. 

A  list-organized  file  combines  the  selective-search  advantages  of  the 
inverted  file  with  the  advantage  inherent  in  the  document-sequenced  organi¬ 
zation  of  obtaining  all  index  terms  associated  with  selected  documents.  Its 
disadvantage  is  that,  for  practical  purposes,  file  size  is  doubled  when 
compared  with  the  other  two.  Like  the  document-sequenced  file,  a  document 
record,  once  accessed,  is  accepted  or  rejected  or.  the  spot;  there  is  no 
carry-over  tc  a  subsequent  stage  of  processing.  The  number  of  records  in¬ 
spected  can  be  minimized  if  the  entry  table  to  the  list  for  each  index  term 
includes  its  number  of  occurrences  in  the  file.  Assuming  logical  "and" 
relationships  between  terms  in  a  search  request,  it  is  easy  to  determine  the 
one  with  the  fewest  occurrences  and  to  examine  only  the  documents  in  that 
list.  More  complex  term  relationships  in  a  request  may  require  entry  to  and 
processing  of  more  than  one  such  list,  but  each  can  be  the  shortest  one 
applicable  to  a  subset  of  the  request  terms. 

The  number  of  records  to  be  accessed  in  searching  a  list-organized  file 
car,  be  no  less  than  the  number  of  occurrences  of  the  least  frequently  used 
index  term  in  the  request.  This  is  highly  variable.  Some  requests  may  con¬ 
tain  a  term  used  in  only  two  or  three  documents;  in  others,  the  least  fre¬ 
quent  term  may  have  50,  100  or  even  mere  occurrences,  and  many  records  must 
be  examined.  Complexity  of  the  search  request  also  can  affect  the  number  of 
record  accesses  required.  Ten  terms  all  connected  by  logical  "ands"  can  be 
processed  by  entering  a  single  list.  If  a  few  "or"  relationships  are  present 
and  no  common  "and"  term  c-xists.  then  two  or  three  lists  may  be  entered. 
Finally,  the  minimum  number  of  records  accessed  is  almost  directly  propor¬ 
tional  to  the  size  of  the  library  (file)  being  searched.  The  thesaurus  of 
index  terms,  once  established,  tends  to  change  rather  slowly  as  documents 
are  added  to  the  file.  The  frequency  of  usage  of  index  terms  increases,  on 
the  average,  directly  as  the  number  of  documents--more  usages  are  recorded 
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to  8  relatively  con-st-snf  number  of  index  terms.  Thus  as  the  library  increases 
in  size,  more  and  more  records  must  be  accessed  in  processing  a  search  request. 
This  characteristic  is  true  even  when  indexing  standards  remain  unchanged. 

Major  thesaurus  revisions  or  different  indexing  criteria  also  affect  the  con¬ 
tents  of  the  file  and  the  processing  of  requests  against  it. 

With  an  inverted  file,  on  the  other  hand,  one  record  is  accessed  for 
each  term  in  a  search  request.  Although  the  number  of  terms  varies,  the 
maximum  typically  does  not  exceed  the  minimum  by  more  than  about  10:1,  and 
a  fairly  high  percentage  of  requests  have  close  to  the  average  number  of 
terms.  In  general,  the  number  of  records  accessed  is  much  smaller  than  with 
a  list-organized  file.  However  the  individual  records  are  longer.  Other 
factors  remaining  unchanged,  the  number  of  records  to  be  accessed  does  not 
change  as  the  size  of  the  library  increases,  but  the  average  record  length 
does  grow  at  a  rate  proportional  to  that  of  the  number  of  documents. 

With  an  inverted  file,  the  amount  cf  data  transferred  into  the  computer 
memory  to  process  a  search  request  is  relatively  more  predictable  than  with 
a  list -organised  file  and  is  not  subject  to  so  much  variation.  Assuming  each 
data  element  to  be  one  word,  it  is  given  by 

Words  Transferred  =  T.TN. +1), 


where 


T.  -  Number  of  index  terms  in  the  request,  and 

N.  =  Average  number  of  documents  in  which  each  term  appears. 

[The  Mi”  in  (N.  +1)  assumes  that  the  index  term  is  one  word  of  the  record.] 

Although  individual  N’s  vary  widely  in  value — from  one  to  several  thousand — 
for  individual  0's,  the  total,  and  the  average,  for  typical  ranges  of  search 
requests  are  subject  to  much  less  variation:  the  maximum  may  be  on  the  order 
of  2-3  times  the  minimum. 

With  a  list-organized  file,  the  number  of  words  transferred  is  given  by 

Words  Transferred  =  N  (20.  +1), 

m  i 


where 


=  Number  of  documents  in.  which  the  least  frequently  used  index  term 
appears,  and 

D.  =  Average  number  of  index  terms  per  document. 

(In  this  expression,  "2D."  appears  because  each  index  term  has  attached  to 

it  the  chain  address  of  the  next  document  in  the  list;  this  also  is  assumed 
to  require  one  word  for  its  representation.)  Unless  Njn  is  very  small,  B. 

will  closely  approximate  the  average  number  of  terms  per  document  in  the 
entire  library  and  thus  is  readily  predictable.  However,  as  has  been  ob¬ 
served,  N  is  highly  variable.  An  examination  of  about  200  search  requests 
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has  not  revealed  any  conclusive  relationship  between  the  number  of  terms  in 
a  search  request  and  the  overall  frequency  of  usage  of  any  one  of  them.  In 
general,  it  appears  that  a  greater  number  of  terms  in  a  request  increases 
the  probability  cf  finding  one  used  fairly  infrequently  in  the  library.  At 
the  same  time,  requests  with  many  terms  tend  to  have  more  complex  logical 
relationships  and  this  increases  the  probability  that  several  index  term 
lists,  and  not  only  one,  will  have  to  be  scanned  in  processing  a  request. 

Without  comparative  analysis,  it  is  not  possible  to  determine  which  of 
the  two  types  of  file  organization  requires  the  lesser  amount  of  data  trans¬ 
fers  in  processing  a  search  request.  The  list-organized  file  almost 

certainly  does  if  N  is  not  over  2-3  times  as  large  as  T.,  but  the  exact 
J  m  J  1 

break-even  point  is  not  known. 

It  appears  certain,  however,  that  the  number  of  different  records  to  be 
accessed  is  considerably  greater  with  the  list-organized  file.  This  factor 
can  become  highly  important  when  the  file  is  maintained  on  a  mass-storage, 
random-access  device,  particularly  if  the  application  is  real  time.  In  this 
case,  access  to  records  can  be  made  on  a  random  basis  with  both  list-organized 
and  inverted  files.  Random  access  time  typically  is  much  longer  than  data 
transfer  time,  even  for  very  long  records.  Consequently,  the  total  elapsed 
time  to  process  a  search  request  almost  always  will  be  greater  with  a  list- 
organized  than  with  an  inverted  file  (even  though  the  actual  central  processor 
time  may  be  less).  The  break-even  point  can  be  taken,  with  sufficient 
accuracy  for  practical  purposes,  as  the  case  in  which  the  number  of  index 
terms  in  the  request  equals  the  minimum  number  of  documents  which  must  be 
examined.  Usually  the  latter  is  considerably  greater. 

Some  proposals  have  been  made  to  modify  the  technique  of  setting  up  a 
list-organized  file  to  permit  more  efficient  retrieval.  One  of  these  has 
been  examined  in  detail,  with  negative  results,  using  a;  data  base  the  large 
sample  of  38,402  DOC  (formerly  ASTIA)  documents  described  in  [l]. 


B .  ANALYSIS  OF  THE  MULTI-LIST  SYSTEM 

The  Multi-List  System  [2],  [3]  is  a  list-organized  file  in  which  each 
list  consists  of  a  set  of  index  terms --three  being  suggested — rather  than 
having  a  separate  one  for  each  term.  Several  potential  advantages  have  been 
cited  for  this  type  of  file  organization:  (1)  Although  the  number  cf  lists 
traversing  the  file  is  increased,  their  average  length  is  reduced  and  varia¬ 
tions  in  length  are  much  less  extreme  than  in  the  usual  list-organized  file; 
(2)  a  document  belongs  to  fewer  lists  and,  because  fewer  chain  addresses  are 
needed,  file  storage  requirements  are  less;  (3)  file  searching  is  faster, 
because  fewer  lists  must  be  examined;  and  (4)  the  method  of  organizing  the 
entry  table  to  the  lists  may  permit  eliminating  some  search  requests  (no 
pertinent  documents  in  the  library),  without  examining  any  list,  by  utilizing 
knowledge  that  two  index  terms  have  never  been  used  together  in  a  document 
description. 
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Fi' ituany  exclusive  Miirioute  croups  anu  formation  oi  lists . 


The  method  of  combining  three  index  terms  into  one  list  is  based  upon 
assigning  each  term  to  one  of  a  limited  number  of  attribute  groups.  In  any 
one  attribute  group,  all  its  index  terms  are  mutually  exclusive;  that  is, 
no  two  terms  in  the  group  are  used  together  in  a  document  description.  The 
index  terms,  then,  are  said  to  be  assigned  to  mutually  exclusive  attribute 
groups.  This  array  is  best  illustrated  by  an  example. 

Suppose  a  file  consists  of  records  each  having  nine  keys  or  attributes, 
each  attribute  in  turn  having  ten  mutually  exclusive  possible  values.  An 
attribute,  for  example,  could  be  military  rank;  each  record  (man)  can  have 
only  one  of  the  possible  values  "private,"  "corporal,"  and  so  on  up  to 
"general."  There  are  90  different  possible  values  (or  index  Lerms)  in  the 
file.  These  can  be  denoted  in  the  form  "0608"  for  the  8th  value  in  the  6th 
attribute  column,  etc.  The  mutually  exclusive  attribute  groups  then  look 
like  this: 


Hi 

2 

3 

D 

5 

6 

H 

8 

9 

0101 

0201 

0301 

0401 

0501 

0601 

0701 

0801 

0901 

0102 

0202 

0302 

0402 

0502 

0602 

0702 

0802 

0902 

0103 

0203 

0303 

04C3 

0503 

0603 

0703 

0803 

0903 

0104 

0204 

0304 

0404 

0504 

0604 

0704 

0804 

0904 

0105 

0205 

0305 

0405 

0505 

0605 

0705 

0805 

0905 

0106 

0206 

0306 

0406 

0506 

0606 

0706 

0806 

0906 

0107 

0207 

0307 

0407 

0507 

0607 

0707 

0807 

0907 

0108 

0208 

0308 

0408 

0508 

0608 

0708 

0808 

0908 

0109 

0209 

0309 

0409 

0509 

0609 

0709 

0809 

0909 

0110 

0210 

0310 

0410 

0510 

0610 

0710 

0810 

0910 

If  each  attribute  value  is  placed  in  a  separate  list,  there  are  90  lists 
' ,  the  file.  Some  (such  as  "private”  or  "ages  20-24")  are  extremely  long, 
while  others  (e.g.,  "general")  are  short.  Also,  each  record  in  the  file 
belongs  to  nine  lists  and  has  nine  tags. 

Now  let  groups  of  three  columns  be  combined  into  a  single  .'uperfield  in 
which  a  superkey  might  consits  of  one  attribute  value  from  each  column,  as 
0104-0202-0307.  Each  superfield  has  a  possible  1,000  ( 10  x  10  x ’.0)  of  these 
superkeys,  not  all  of  which  are  present  in  the  file  (generals,  ages  20-24, 
earning  $40-^49  weekly,  probably  are  nonexistent).  If  a  superkey  corresponds 
to  a  list,  there  are  at  most  3,000  in  the  file  (three  superfields  of  1,000 
superkeys  each).  Although  there  now  are  many  more  lists  traversing  the  file, 
the  extremely  long  ones  previously  existing  are  broken  up  into  many  smaller 
ones  by  the  grouping  of  three  attribute  values  into  one  superkey.  The 
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short  lists,  of  course,  are  even  shorter.  Each  record  now  has  only  three, 
rather  than  nine,  chain  addresses  or  tags. 

Alternatively,  a  superkey  may  be  created  by  grouping  two  or  more 
attribute  values  from  each  of  the  three  columns,  the  superkey  now  repre¬ 
senting  a  range  of  values  rather  than  a  unique  combination.  For  example, 
0401  or  0402  may  be  combined  with  either  0501  or  0502  and  also  with  either 
0603  or  0604,  eight  different  combinations  in  one  superkey.  Each  column  has 
five  of  these  pairs  of  values  and  a  group  of  three  columns  has  125  super¬ 
keys.  The  array  as  a  whole  has  375  of  them,  defining  375  lists  in  the  file. 
With  proper  ordering  of  attribute  values  within  a  column,  the  very  short 
single  lists  can  be  eliminated  by  combining  them  with  longer  ones  and  all 
lists  made  approximately  the  same  length — possibly  a  2  or  3  to  1  maximum 
variation . 

This  mutually  exclusive  attribute  group  array  serve:;  as  the  entry  table 
to  the  lists  traversing  the  data  file.  Each  superkey  in  the  array  has 
attached  to  it  the  storage  location  (or  other  iuentific?  ion)  of  the  first 
record  in  the  list.  A  desired  superkey  in  the  array  cai  De  isolated  by 
standard  searching  techniques  which  successively  narrow  the  portion  of  the 
entry  table  in  which  it  lies. 


2 .  Application  of  Multi-List  System  to  IS&R  (Document  Retrieval). 

In  many  types  of  files,  some  (or  a.,1)  of  the  data  elements  are  values 
of  attributes,  such  as  age,  salary,  years  of  education,  etc.  Here  the 
existence  of  a  given  entry  for  an  attribute  precludes,  by  definition,  any 
other  value  for  one  file  record;  a  person  cannot  have  two  different  ages  at 
the  same  time.  Thus  the  entries  in  the  attrib'*e  group  are  mutually  exclu¬ 
sive. 


The  index  terms  in  a  document  fiie  do  not  have  this  type  of  mutual 
exclusiveness.  Although  many — perhaps  most — pairs  may  define  concepts  whicn 
are  extremely  unlikely  to  co-occur  in  a  document,  it  is  perfectly  possible 
that,  given  a  library  of  large  enough  size,  any  two  terms  chosen  at  random 
will  be  used  in  the  same  document  description.  Their  mutual  exclusiveness 
is  strictly  a  function  of  usage  and  two  terms  which  are  exclusive  today — i.e., 
have  never  been  used  together  in  one  document — may  not  be  tomorrow.  Use  of 
the  Multi-List  System  in  an  ISiR  application  requires,  then,  not  only  an 
algorithm  to  set  up  the  array  of  mutually  exclusive  attribute  groups  initially, 
but  also  to  reorder  it  when  previously  exclusive  index  terms  in  one  group 
are  used  together  in  one  document  description.  This  process  of  changing  terms 
from  one  column  to  another  to  maintain  exclusiveness  is  called  renaming. 

It  is  evident  that  the  minimum  possible  number  of  attribute  groups  is 
at  least  as  great  as  the  largest  number  of  index  terms  used  in  a  single  docu¬ 
ment  description.  The  actually  realizable  minimum  may  be  considerably  larger, 
and  most  likely  is.  In  the  DDC  sample,  the  maximum  number  of  terms  in  any 
one  document  is  21. 
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In  applying  the  Multi-List  System  to  the  total  DOC  document  file,  it 
was  believed  initially  that  the  6,000-odd  descriptors  (DDC  index  terms)  could 
be  arranged  into  about  30  mutually  exclusive  attribute  groups  or  columns 
(subsequently  raised  to  40),  each  containing  about  200  descriptors.  In  each 
column  descriptors  are  grouped  into  ten  sets  of  about  20  each,  one  set  from 
each  of  three  columns  comprising  a  superkey  covering  a  range  of  descriptor 
code-value  combinations.  Each  group  of  three  columns  has  1,000  of  these 
superkeys  serving  to  define  lists,  or  a  total  of  10,000  lists  traversing  the 
documents  stored  in  the  Multi-Association  Area  (the  file  of  document  numbers, 
descriptors  and  chain  addresses). 

To  examine  the  feasibility  of  the  Multi-List  System,  a  computer  algo¬ 
rithm  was  developed  to  set  up  the  array  of  mutually  exclusive  attribute 
groups.  A  UNIVAC  I-II  program  was  written  and  run  against,  the  collection  of 
38,402  DDC  documents  available  for  testing  and  analysis. 
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3.  Descriptions  and  Results  of  Experiments  Using  DDC  Data. 

The  methodology  followed  and  results  of  the  computer  experiment  are 
summarized  in  the  following  paragraphs.  More  detailed  descriptions  have 
been  reported  previously  in  [3]-[7]. 

This  phase  of  the  study  sets  up  the  mutually  exclusive  attribute  group 
array,  using  descriptor  relationships  in  the  DDC  data  as  basis  for  the  allo¬ 
cations,  and  is  designed  to  answer  two  basic  questions: 


What  is  the  achievable  minimum  number  of  groups  into  which  the 
descriptors  can  be  assigned? 


How  complex  a  renaming  process  is  required  to  retain  a  minimum 
number  of  groups  as  the  introduction  of  new  associations 
necessitates  reordering  of  the  array? 


a.  Notation.  T  notation  used  in  these  analyses  is  modified  slightly 
from  that  in  published  literature  on  the  Multi-List  System.  Symbols  used  are: 


D  —  The  description  used  to  describe  a  document  and  consisting 

of  a  number  of  descriptors,  denoted  by  either  d  or  d.  .. 

a  1  f  J 

d  —  The  descriptors  in  D,  l<a^m.  Used  when  location  within 
the  attribute  groups  is  not  pertinent  or  is  not  known. 


c.  —  A  column  or  group  in  the  array  of  mutually  exclusive 

attribute  groups.  l^i<f,  where  c,,  is  the  last  group. 

4 

d^  j  —  The  jth  descriptor  in  the  ith  column;  e.g.,  d^  ^ 

12th  descriptor  in  group  7. 

i.Jn  --  Alternate  form  for  writing  d^  j ,  particularly  when  it  is 
necessary  to  differentiate  two  j’s  in  one  column. 
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C. 

i 


Notation  used  to  denote  a  descriptor  in  c.,  retaining  the 
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code  itself  or  its  frequency  of  usage  rank.  Thus,  11-3860 
represents  descriptor  code  3860  in  group  11  and  01-23,  the 
23rd-ranked  descriptor  tin  the  total  file)  in  group  01.  Each 
group  normally  is  sequenced  on  v,  hut  only  the  v's  in  the 
group  are  included. 


The  set  of  descriptors  in  a  column  c^, 
j  1,  2,  .«.*  n. 


It  is  d.. 


D.  .  —  The  set  of  all  descriptors  with  which  a  d.  .  is  associated  in 

1 1 J  -  >  J 

use.  By  definition,  none  of  them  can  be  in  c^. 

■).  —  The  set  of  all  descriptors  associated  in  use  with  any  one  or 

more  of  the  d.  .  in  c..  It  is  the  looical  sum  of  all  the  0.  . 

i.j  i  i .  J 

in  a  c^. 

p  —  The  number  of  groups  not  having  a  descriptor  in  D. 


Cj.  —  An  individual  group  or  column  not  having  a  descriptor  in  D. 
m  1  <  m  <  p  <  f. 


List  K 

a  n  b 


a  U  b 


The  c^  with  the  descriptors  included  in  each, 
m 

"a  is  inclusive  with  (used  with)  b."  a  and  b  can  vary  in  form 
and  may  differ.  Thus,  i , j ^  0  i.j0  or  H  i-Vg  means  that 

a  single  specified  descriptor  pair  is  inclusive,  i.j^  H  cb 

means  that  i.j,  is  inclusive  with  one  or  more  o?  the 

i  o.  J 

without  specifying  which  one(s). 

"a  is  exclusive  to  (not  used  with)  b."  a  and  b  can  vary  as 
above. 


b.  Definition  of  Renaming.  Assume  that  the  mutually  exclusive  attribute 
groups  have  been  established;  no  two  descriptors  in  any  one  column  are  used 
together  in  a  document.  Now  if  a  new  description  D  contains  two  previously 
exclusive  descriptors  i , j ^  and  i*.  if  is  necessary  to  move  one  of  them  into 
another  column.  The  Multi-List  System  proposes  these  types  of  renamings: 

First-Order  Renaming.  If  there  is  a  column  c^  such  that  i . j ^  U  c^, 

then  i.j.  can  be  moved  to  c,  ,  becoming  k.j  .  Conditions  for  exclusive- 
i  k  n 

ness  are  now  met  by  c^.  Similarly,  exclusiveness  may  be  m.aintais;ed  by 
moving  i , j 2  to 
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Secoori-Order  Renaming.  There  may  be  no  c-  into  which  i.i,  (or  i,jn) 

«V  1  Cm 

can  be  shifted.  If,  however,  a  descriptor  k,j.  can  be  shifted  into 
st-Il  another  column,  thereby  making  i,jj  U  c^,  then  a  double  shift 
will  maintain  the  exclusiveness — i.e.,  k,jx  -*  ct  (k,jx  U  c^)  and 
i,j,  (or  i,j2>  -  cR. 


nth-Order  Renaming..  The  above  process  can  be  repeated  any  number  of 
times.  Without  specifying  a  value,  the  Multi-List  System  recognizes 
that  an  upper  limit  to  orders  of  renaming  should  be  set.  If  the  limit 
is  reached  without  a  successful  renaming,  it  is  concluded  that  the 
input  descriptor  is  inclusive  to  every  column  and  consequently  can  be 
placed  in  none  of  them.  At  this  point,  either  the  number  of  attribute 
groups  is  increased  by  one  or  recourse  is  made  to  a  human  monitor. 

The  basic  flow  chart  for  the  logical  operations  required  for  first  and 
second  order  renamings  is  shown  in  Figure  1.  Except  for  slight  changes  in 
notation,  this  is  identical  with  those  presented  on  pp.  67-68,  Part  I, 

Volume  I  of  reference  [l].  The  chart  begins  at  the  point  where  existence 
of  a  conflict  within  a  group  has  become  known.  In  effect,  it  includes  the 
basic  logical  operations  required  for  all  orders  of  renaming;  those  of  third 
and  higher  order  can  be  considered  as  successive  applications  of  the  second 
order  renaming  process. 

During  the  course  of  the  study  it  became  apparent  that  another  type  of 
renaming  was  not  only  possible,  but  also  was  necessary  to  maintain  the  number 
of  groups  at  a  minimum.  This  is  defined  thus: 


Second-Deqree  Renaminq.  The  requisite  k, j..  may  not  exist  for  successful 

X 

second-order  renaming.  However,  if  there  exists  also  a  k,j^  such  that 

(1)  D  -(D  +D.  )  is  exclusive  to  i,j  ;  (2)  k.j  is  exclusive  to  some 

K  K ,  r  K  f  S  1  X 

D^;  and  (3)  k,jy  is  exclusive  to  some  D  (where  jo  may  equal  m  );  then 


the  triple  movement  k.j. 


c  ,  k.j 
m  Jy 


cp  and 


c,  restores  exclusive- 
k 


ness. 


nth-Deqree  Renaming.  This  follows  the  above  principle,  except  that  n 
descriptors  are  moved  cut  of  c^. 


In  theory,  any  degree  of  renaming  can  occur  at  any  stag^  of  an  nth-order 
renaming. 


c.  Algorithm  Requirements  and  Practical  Limitations  on  the  Computer 
Experiment.  Although  Figure  l  is  complete  for  the  basic  logic  of  what  must 
be  done  in  descriptor  renaming,  it  is  not  a  computer  solution  or  flow  chart 
of  how  it  is  to  be  accomplished.  For  purposes  of  this  study,  it  was  first 
necessary  to  devise  a  detailed  method  which  was  feasible  of  operation,  in 
terms  of  time  and  cost,  on  the  UNIVAC  I- I I  magnetic-tape  processors  available 
for  use. 
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when  removed  from  km  would 
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Renaming 

Successful 


SECOND  ORDER  RENAMING 


Unsuccessful. 

To  third  order  renaming, 
manual  intervention,  or 
addition  of  another  column 
to  attribute  groups 


Figure  1 

Multi -Li st  System 

Logical  Flow  Charts  for  Renaming  Process 
to  Maintain  Attribute  Group  Exclusiveness 


The  study  objectives  required  methods  (1)  for  the  initai  establishment 
of  the  array  of  attribute  groups  based  upon  actual  usage  of  descriptors  in 
a  reasonably  large  "base"  document  file;  and  (2)  for  maintaining  the  array 
as  new  documents  are  added.  Preferably,  the  same  basic  machine  program 
should  take  care  of  both.  The  specifications  listed  below  must  be  met; 
some  of  them  are  framed  to  reflect  the  particular  characteristics  of  a  tape 
processing  system. 

A  file  of  attribute  groups,  c.,  and  the  descriptors  within  each 

must  be  maintained.  Within  each  c. ,  the  d.  .  are  ordered  in  some 

*  *  »  J 

systematic  manner--e.g. ,  in  descriptor  code  or  frequency-of-usage 
sequence. 

A  cross  reference  between  descriptor  name  or  code  and  its  attribute 
group  must  be  set  up  and  kept  current. 


The  descriptor  set  D.  .  for  each  d.  .  must  be  established  and 

+  ♦  J  *  *  J 

current.  So  long  as  documents  are  not  removed  from  the  file, 

maintenance  procedure  need  provide  only  for  D.  .  +d,  •*  0.  . 

_  *  t  J  k  #  r  7  ?  J 


new  association  d.  . 

*  i  J 


Pi  d,  is  introduced, 
k.r 


kept. 

the 

when  a 


The  descriptor  set  must  be  established  and  kept  current  for  each  c. . 

The  maintenance  procedure  must  provide  for  both  D.  +  D.  .-*£).  and 

1  1  »  J  1 

0.  -  D.  .  -*  D.  to  reflect  the  effects  of  the  movement  (renaming)  of  a 
i  1 1 J  i 

d.  .  into  or  out  of  c. . 
i.J  * 


Computer  and  cost  considerations  made  it  evident  that  the  entire  5540 

descriptors  in  the  DOC  sample  could  not  be  handled.  Accordingly,  the  599 

most  frequently  used  were  selected;  this  includes  all  descriptors  with  72 

or  more  occurrences  in  the  sample  file.  This  choice  permitted  setting  up 

the  descriptor  sets  D.  and  D.  .  as  2-block  records  ((JNIVAC  I-II  blocks  of 

i  i .  J 

60  words  each)  in  matrix  form.  2-character  fields  in  each  record  correspond¬ 
ing  positionally  to  the  descriptors  001-599  taken  in  rank-number  sequence. 

The  number  of  co-occurrences  of  a  d.  .  with  each  of  the  other  596  descriptors 

1 » J 


can  be  accumulated  readily  as  2-digit  numbers  in  the  proper  positional  field 
(very  few  pairs  occur  more  than  100  times).  The  600th  field  identifies  the 
descriptor  or  attribute  group  to  which  the  record  pertains. 


Tape-handling  considerations  also  made  it  clear  that  computer  renaming 
would  have  to  be  restricted  to  avoid  excessive  "tape  spinning."  Thus,  re¬ 
naming  was  limited  to  the  forward  direction  of  tape  movement,  equivalent  to 
moving  a  descriptor  only  into  columns  to  the  right  of  the  one  from  which  it 
must  be  moved. 


The  investigation  of  aspects  not  taken  care  of  by  the  computer  program 
were  covered  by  selecting  the  50  most  common  descriptors  and,  within  them, 
taking  only  pairs  with  five  or  more  co-occurrences.  This  reduced  the  amount 
of  data  to  a  volume  permitting  manual  simulation  of  the  algorithm.  Some 
manual  simulation  also  was  performed  with  the  100  most  common  descriptors. 
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In  all  of  these  studies,  descriptors  are  identified  bv  their  rank 


number  based  upon  frequency  of  usage  in  ihe  file;  001 
599,  least.  A  complete  list  of  the  599  descriptors  i 


is  most  frequent  and 
contsinsd  in  Tsbis 


A- 1  (Appendix  A),  in  rank  number  sequence.  Each  descriptor  shows  the  number 
of  different  pairs  it  forms  with  the  other  598  and  also  the  ASTIA  field  and 


group  to  which  it  was  assigned  in  1960-1961  (the  period  during  which  most  of 
the  documents  in  the  sample  were  described.  It  should  be  noted  that  field/ 


group  assignments  h?'-'d  been  changed  since  that  time). 


d .  Summary  Statistics  of  Fair-Associations  Among  599  Most.  Common  l.»QC 
Descriptors .  These  descriptors  have  49,306  pair-combinations  (twice  os  many 
permutatior.s)  with  248,425  total  occurrences.  These  constitute  23.6$  of  the 
different  pairs  in  the  total  file  and  46.8$  of  all  pair  occurrences.  On  the 
average,  each  descriptor  is  used  with  165  of  the  other  598;  the  range  is  from 
49  to  579.  Table  A-2  (Appendix  A)  summarizes  these  pair-associations  by 
number  of  occurrences. 

As  might  be  expected,  descriptors  used  very  frequently  are  highly  likely 
to  co-occur  in  document  descriptions.  Chart  1  depicts  the  pair-associations 
among  the  100  most  common;  the  lower  left  triangular  matrix  is  a  graphic 
portrayal  of  this,  with  the  actual  number  of  co-occurrences  of  each  pair  in 
the  upper  right  portion.  Some  breakdowns  of  possible  and  actually  existing 
pairs  are  summarized  in  this  table: 


Association 

Possible 

Combinations 

Percentage 

Type 

Combinations 

Occurring 

Occurrinq 

Associations  within 

Ranks  1-50 

1,225 

1,086 

88.8$ 

Associations  of  Ranks  1-50 
with  Ranks  51-100 

2,500 

1,928 

77.1 

Associations  within 

Ranks  51-100 

1,225 

681 

55.6 

Associations,  one  member 
in  Ranks  101-599 

174, 151 

45,609 

26.2 

All  Associations, 

Ranks  1-599 

179, 101 

49,306 

27.5$ 

Although  over  a  fourth  of  the  possible  pairs  actually  exist  in  the 
sample,  it  should  be  noted  that  most  of  them  do  not  occur  frequently.  In 
fact,  over  A0%  occur  only  once  and  almost  60$  only  once  or  twice  (see  Table 


A-2). 
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e •  Automatic  Attribute  Group  Assignment  of  599  Most  Common  DDC 
Descriptors .  Consider  first  the  initial  establishment  of  the  array  of 
mutually  exclusive  attribute  groups,  which  presupposes  that  a  file  of  docu¬ 
ment  descriptions  exists.  Without  loss  of  generality,  it  can  be  postulated 
that  all  pair  associations  are  formed  prior  to  assigning  any  descriptor  to 
a  column  in  the  array.  This  is  equivalent  to  forming  the  record  j  of 

associations  for  each  descriptor  and  then  beginning  the  assignment.  This  is 
the  method  followed  in  the  computer  program. 


The  program  assigns  the  first  descriptor.  Rank  1,  to  attribute  group 
(column)  and  sets  up  0^,  which  at  this  point  is  the  same  as  0^  ^ .  The 

next  descriptor,  Rank  2,  is  taken  and  D1  examined  to  determine  whether  it 

contains  the  descriptor.  If  it  does,  the  descriptor  is  not  exclusive  to 

and  is  placed  in  with  ^  being  created. 


In  the  general  case,  the  next  descriptor  dg  is  taken  and  each  D^, 
beginning  with  D^,  is  examined  to  see  whether  or  not  it  contains  dg.  If  it 
does--meaning  that  d  forms  a  pair  with  one  or  more  of  the  descriptors 

3 

already  in  c. — the  for  the  next  column  is  examined.  This  process  continues 

until  one  of  two  conditions  terminates  the  cycle:  (1)  a  column  is  found  in 
which  d  does  not  appear  in  D. ,  in  which  case  d  is  added  to  that  column 

(becoming  d.  .)  and  0.  is  updated  by  logical  superimposition  of  the  de- 
1 1 J  i 

scriptor's  0.  or  (2)  d  is  not  exclusive  to  any  column  so  far  formed,  in 
1 ,  J  3 

which  case  it  becomes  the  first  member  d  .  of  a  new  column  C  and  its  D  . 

m,  1  m  m, 1 

becomes  the  new  D  . 

m 

This  technique  assures  that  each  descriptor  is  assigned  to  the  left¬ 
most  (lowest-numbered)  column  c^  in  which  it  is  exclusive  (i.e.,  is  not  used 

in  c.).  The  most  time-consuming  part  of  the  computer  operation  is  the 

handling  of  the  tape  containing  the  D.  records.  Because  each  descriptor 

added  to  a  column  means  superimposing  its  D.  .  on  the  existing  D, ,  this  tape 

*  •  J  * 

must  be  rewound  and  rewritten  for  each  descriptor  processed*  The  tape  con¬ 
taining  the  records  is  set  up  in  rank  number  sequence  and  read  only 

once  during  execution  of  the  machine  program.  Allocating  599  descriptors 
required  3.75  hours  on  the  UNIVAC  II. 


This  algorithm  assigned  the  599  descriptors  to  56  attribute  groups, 

Table  1,  each  containing  from  1-16  entries.  It  is  not  known  whether  or  not 
this  is  a  minimum.  The  technique  assures  that  each  descriptor  is  placed  in 
the  first  possible  column  and,  therefore,  is  used  (forms  a  pair)  with  at 
least  one  other  descriptor  in  every  lower-numbered  column.  However,  it  is 
possible  that  movement  of  a  descriptor  from  c.  into  a  higher-numbered  column 
might  eliminate  some  associations  from  0.  and  thereby  permit  transferring 

other  descriptors  down  into  The  most  obvious  candidate  is  the  lone  entry 
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in  column  56;  by  juggling  entries  in  some  of  the  other  columns,  it  might  be 
fitted  into  one  of  the  55  left.  If  such  a  solution  exists,  it  has  not  been 
found.  The  sheer  volume  of  data  involved  precludes  manual  analysis  and  the 
thousands  of  possible  rejugglings  make  a  computer  trial  impracticable  from 
a  cost  standpoint. 

This  56-column  array  is  almost  twice  as  large  as  the  30  originally 
considered  probable.  Because  the  very  frequently  occurring  descriptors  form 
many  different  pairs — see  Chart  1  and  Table  A-l  (Appendix  1) — and  are  rather 
broad  in  meaning,  the  need  for  assigning  them  into  the  attribute  groups  has 
been  questioned.  It  may  be  more  appropriate  to  make  a  separate  list  for 
each  one  of  them  and  to  restrict  the  lists  formed  by  combining  descriptor 
ranges  in  each  of  three  columns  to  the  attribute  groups  created  from  the 
less  frequently  used  descriptors. 

Choice  of  a  cutoff  point  for  this  variation  (which  is  not  part  of  the 
original  Multi-List  System)  is  arbitrary.  A  new  attribute  group  array  was 
created  after  eliminating  the  19  most  common  descriptors,  all  of  which  had 
868  or  more  occurrences.  The  580  assigned  are  all  of  the  descriptors  in 
from  72  to  846  document  descriptions.  The  resultant  array,  Table  2,  contains 
46  columns.  Although  smaller  than  the  first,  it  still  is  higher  than  the 
30  thought  possible. 

The  development  of  these  two  arrays  is  equivalent  to  their  initial 
establishment  and  provides  no  information  on  the  effects  of  using  the  same 
algorithm  for  a  file-updating  type  of  operation.  Once  initially  established, 
the  attribute  group  array  requires  updating  (adjustment)  as  new  documents 
add  new  descriptor  pair  associations  to  those  already  existing.  Most  of 
these  involve  descriptors  in  different  columns  and  do  not  destroy  the  mutual 
exclusiveness  within  each.  However,  some  new  pairs  involve  descriptors  in 
the  same  column  and  one  of  them  must  be  moved,  or  renamed,  to  maintain  ex¬ 
clusiveness.  The  UNIVAC  II  computer  algorithm  does  not  accomplish  this 
renaming  operation.  Although  the  basic  approach  for  modifying  it  to  accom¬ 
plish  first-order  renaming  is  relatively  strightforward,  actual  computer 
time  to  run  the  program  on  even  the  fairly  small  set  of  599  descriptors  is 
excessive. 

Nonetheless,  it  was  considered  pertinent  to  obtain  some  idea  of  the 
percentage  of  conflicts  which  result  from  adding  new  documents  to  an  already- 
established  file.  For  this  purpose,  the  basic  file  was  re-created  after 
eliminating  all  pair  associations  occurring  only  in  a  random  10^  sample — 
all  document  numbers  ending  in  "9”— of  the  38,402-document  file.  The  algo¬ 
rithm  then  was  rerun  using  the  pairs  remaining  and  the  resulting  attribute 
group  array  checked  to  determine  how  many  conflicts  would  occur  by  intro¬ 
ducing  the  associations  found  only  in  the  10%  sample.  (In  effect,  this 
considers  the  sample  as  constituting  input  to  an  updating  cycle.) 

The  "9's"  sample  comprises  3,828  documents  and  includes  just  under 
13,800  of  the  49,306  different  pair  combinations  in  the  full  file.  Of  these, 
2,062--about  15/1 — are  unique  to  the  M9's"  documents;  none  occur  in  more  than 
three  documents.  Occurrences  in  the  full  file  and  the  sample  are  summarized: 
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Table  2 

Mutually  Exclusive  Attribute  Group  Assignment  of  the  20th-599th 

Most  Common  DDC  Descriptors* 


•Oetcriplors  denoted  by  frequency-of-u$i.je  r*tik  number. 


No.  of  Pair 

Different  Pairs 

Different  Pairs 

Occurrences 

in  Full  File 

Found  Only  in  "91 

1 

20,218 

2,001 

2 

8,654 

59 

3 

4,854 

2 

Using  the  47,244  pair  combinations  in  the  90$  of  the  file,  the  algorithm 
assigned  the  599  descriptors  into  55  mutually  exclusive  attribute  groups, 

Table  3,  each  with  from  2  to  16  descriptors.  This  is  one  less  than  the  56 
columns  for  the  full  file,  Table  1.  However,  the  dispersion  of  descriptors 
in  Table  2  is  markedly  different.  The  first  28  columns  have  significantly 
more  descriptors — 369  against  351.  Seven  groups,  compared  with  four  in 
Table  1,  have  six  or  fewer  descriptors  each  and  eleven  have  either  15  or  16, 
against  only  three.  The  assignment  is  the  same  for  the  first  63  descriptors, 
differences  beginning  with  rank  number  64,  but  only  Groups  1  and  3  in  the 
final  arrays  are  identical.  However,  although  the  two  arrays  are  quite 
dissimilar  for  a  difference  of  less  than  5%  in  the  number  of  pair  associations 
included  (47,244  and  49,306),  it  appears  impossible  to  draw  any  meaningful 
conclusions  from  this  fact.  The  basic  files  in  both  cases  are  large — 34,500 
documents  or  more--and  the  form  of  the  final  arrays  is  more  apt  to  depend  upon 
chance  variations  in  the  particular  pairs  present  than  upon  some  meaningful 
factor. 

The  2,062  pair  combinations  unique  to  the  ”9’s”  sample  create  60  conflicts 
with  the  descriptor  assignments  of  Table  2;  the  first  is  the  pair  2-174  in 
Column  2.  The  conflicts  comprise  about  3%  of  the  new  pairs  and  occur  at  the 
rate  of  one  in  about  65  new  documents.  Two  new  documents,  in  turn,  generate 
slightly  more  than  one  new  pair  among  the  599  most  common  descriptors.  (It 
should  be  noted  that  some  documents — possibly  as  many  as  25$ — do  not  have  two 
descriptors  among  these  599  and  create  no  pair  entering  into  the  algorithm.) 

All  of  these  conflicts  must  be  resolved  by  the  updating  algorithm.  In 
an  attempt  to  ascertain  some  of  the  results  of  this  renaming  operation,  the 
adjustments  have  been  traced  through  partially  on  a  manual  basis. 

The  simplest  renaming  is  first  order--moving  one  of  the  two  conflicting 
descriptors  into  a  column  in  which  it  is  exclusive.  This  must  be  done  in 
some  prescribed  order — e.g.,  by  columns  from  low  to  high,  in  ascending 
sequence  on  descriptor  code  number  of  the  pairs,  in  sequence  on  their  rank 
numbers,  etc.  Results  differ  depending  upon  the  order  of  resolution  and  also 
upon  how  much  of  available  knowledge  is  used  at  the  time  a  particular  conflict 
is  resolved.  If,  for  example,  all  new  pair  associations  are  posted  before 
renaming  begins,  then  only  the  60  conflicts  must  be  clarified  and  the  new 
assignment  is  final  for  the  cycle.  On  the  other  hand,  if  new  associations 
are  accessed  in  the  prescribed  order  and  conflicts  resolved  as  they  occur, 
then  a  renaming  subsequently  may  give  rise  to  another  conflict  above  the  60 
already  known.  Both  methods  have  been  carried  out  far  enough  to  demonstrate 
that  the  resultant  array  has  at  least  57  columns.  It  possibly  has  more, 
because  this  point  has  been  reached  before  half  of  the  known  conflicts  have 
been  resolved.  From  Table  1,  it  is  known  that  a  56-column  array  is  possible. 
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* Descriptors  denoted  by  frequency-of-usage  rank  number. 


Consequently,  it  again  is  concluded  that  an  updating  algorithm  limited  to 
first-order  renaming  does  not  maintain  a  minimum _array_ of  attribute  q roups . 

Because  of  the  large  number  of  possible  movements  which  must  be  examined, 
no  attempt  has  been  made  to  resolve  conflicts  with  higher-order  renaming 
algorithms . 


f.  Summary  of  Manual  Analyses  of  Mult: -List  System.  The  nonmachine 
studies  of  necessity  have  been  confined  to  a  limited  number  of  descriptors 
with  a  data  volume  small  enough  to  permit  human  simulation  of  computer 
processes.  Their  purpose  has  been  (1)  to  examine  the  effects  of  alternative 
choices  of  action  in  the  renaming  operation  and  (2)  to  determine  the  degree 
of  complexity  of  renaming  needed  to  maintain  the  minimum  number  of  attribute 
groups.  The  results,  reported  upon  in  detail  in  [4]-[6],  are  summarized 
briefly  here,  but  the  attribute  group  arrays  are  included. 

The  first  trial  used  the  50  most  common  descriptors  and  pairs  among  them 
with  five  or  more  occurrences.  The  associations  are  shown  in  Chart  2.  By 
inspection,  it  can  be  seen  that  most  of  the  first  19  descriptors  (ail  except 
12  and  16)  are  used  with  each  other  and  therefore  must  be  in  separate  columns. 
The  remaining  ones  are  placed  initially  in  the  first  (lowest  numbered)  column 
to  which  it  can  be  assigned  based  solely  upon  lack  of  association  with  the 
one  descriptor  at  the  head  of  the  column.  This  initial  assignment  is  shown 
at  the  top  of  Chart  3.  It  utilizes  only  the  pair  associations  formed  by  the 
17  descriptors  in  the  first  line  of  the  array. 

Descriptors  20-50  are  then  processed  in  sequence,  each  one  adding  the 
new  associations  formed  by  it  with  the  remaining  descriptors;  e.g.,  processing 
descriptor  (rank  number)  20  adds  its  associations  with  21-50,  etc.  Some  of 
these  newly  introduced  associations  cause  conflicts  which  require  that  one 
of  the  descriptors  be  moved  into  a  new  column.  This  is  done  using  only 
associations  so  far  known  to  determine  exclusiveness  and  the  assignment  may 
be  changed  subsequently  as  the  remaining  descriptors  are  processed  in  turn. 

It  will  be  noted  that  this  process  corresponds  to  a  file-updating  operation, 
with  renaming  limited  to  higher  numbered  columns. 


There  are  four  variations  in  the  sequence  of  processing  steps  when  a 
new  pair  d  d  f  is  introduced  and  causes  a  conflict  (i.e.,  both  are  in  the 

a  f  D  C,  I 

same  column  and  one  must  be  moved).  Each  variation  results  in  a  different 
final  array: 


Variation 

1: 

Check 

°a,b 

before 

d  _ 
e,  f 

Variation 

2: 

Check 

d  , 
a,  b 

before 

d  <• 
e,  f 

Variation 

3: 

Check 

d  f 
e,  f 

before 

^a,  b 

Variation 

4: 
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d  r 
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before 

ria,  b 

Move 

d  , 
e,  f 

to 

a 

new 

group 

Move 

d  K 

a,  b 

to 

a 

new 

group 

Move 

d  K 

a,  b 

to 

a 

new 

group 

M  >ve 

d  <■ 
e,  f 

to 

a 

new 
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Chart  2 

Pair  Associations  Occurring  Five  Times  or  More 
Among  the  50  Most  Frequently  Used  ASTIA  Descriptors 
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Observe  that  the  solutions  uifier  only  when 
that  is,  when  both  descriptors  are  exclusive  to  the  first  group  in  which 
either  is  exclusive,  or  when  both  are  inclusive  to  all  established  groups,, 
Once  an  initial  difference  of  action  has  taken  place,  many  of  the  subsequent 
renamings,  of  cou":e,  are  different,  although  some  a^e  still  identical.  It 
is  to  be  expected  that  tne  four  solutions  will  all  differ  from  each  other. 


Results  of  processing  each  of  the  four  variations  are  shown  it.  Chart  3, 

23  or  23  grours  being  recuired  for  the  50  descriptors.  No  conclusions  can 
be  drawn  as  to  which  variation  is  preferable;  the  fact  tnat  two  of  them 
require  one  more  column  may  be  due  only  to  the  particular  data  in  the  experi¬ 
ment  and  cannot  be  used  to  conclude  t  at  they  are  of  necessity  less  preferable. 
The  different  results  do  indicate  thbL  it  may  be  extremely  difficult  to 
select  a  sequence  of  operations  which  will  assure  a  minimum  number  of  attri¬ 
bute  groups. 

Whether  or  not  the  number  of  columns  (22)  is  actually  minimum  is 
unknown.  It  has  b-  en  proved  that,  with  these  data,  at  least  21  are  required. 
However,  attempts  to  reduce  the  array  to  21  columns  have  been  unsuccessful 
and,  similarly,  no  proof  has  been  developed  to  show  that  2'c  are  necessary. 

A  series  of  manual  simulations,  identical  in  approach  to  the  foregoing, 
then  was  performed  on  the  100  most  common  descriptors,  using  all  pair 
associations  existing  among  them.  (The  associations  are  shown  in  Chart  1.) 
Results  are  present  in  Chart  4,  with  from  39  to  42  columns  being  required, 
depending  upon  the  variation  chosen.  The  atray  of  4E  is  a  further  modifica¬ 
tion  of  the  procedure  creating  4B  and  is  included  to  illustrate  the  effects 
of  retracting  renamings  which  subsequent  actions  show  to  be  unnecessary. 

Thus,  if  d  ,d  f  conflict  and  d  „  is  moved,  a  pair  association  introduced 

at  a  later  time  may  result  also  in  moving  d  ,  into  a  new  column.  But  this 

3  f  D 

may  make  it  possible  to  restore  d  .  to  the  original  column.  This  procedure 

e ,  x 

was  followed  in  creating  the  array  of  Chart  4E.  The  array  of  4F  follows  the 
logic  of  setting  up  the  attribute  groups  initially,  using  all  pair  associations 
in  the  data  to  guide  the  assignment. 

The  arrays  generated  with  the  sets  of  both  50  and  100  descriptors  have 
been  set  up  using  first-order  renaming  only.  All  have  been  reviewed  for 
reduction  in  number  of  columns  through  more  complex  renamings  and,  mostly  fcy 
chance,  it  has  been  shown  that  the  arrays  4b,  4C  and  40  can  be  cut  one  column. 
That  reducing  4C  to  38  columns  is  most  sophisticated,  involving  second-order, 
third-degree  renaming.  It  is  reduction  that  introduces  the  concept  of  nth- 
degree  renaming  to  the  nth-order  renaming  initially  proposed  for  the  Multi- 
List  System. 

The  results  of  these  simulations  bring  out  several  significant  factors 
pertaining  to  the  establishment  and  maintenance  of  descriptors  in  a  minimum 
number  of  mutually  exclusive  attribute  groups. 


First,  file  storage  requirements  for  holding  descriptor  pair  associations 
are  large.  A  record  must  be  maintained  for  each  descriptor  showing  every 
other  descriptor  with  which  it  is  used  in  a  document  description.  In  the 
case  of  the  DDC  sample,  this  auxiliary  file  is  more  than  twice  as  large  as 
the  basic  document/descriptor  file  itself.  Furthermore,  most  pairs  occur 
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Chart  3 

Final  Arrays  Resulting  From  Four  Variations  of  First-Order  Renaming 
(30  Most  Common  DDC  Descriptors,  Pair  Associations  With  5  or  More  Occurrences) 
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only  once  or  twice  and  new  pairs  are  added  at  a  fairly  high  rate  as  new  docu¬ 
ments  are  introduced.  Other  libraries  may  have  characteristics  different 
from  that  of  DDC  and,  presumably,  at  some  stage  of  library  size,  the  number 
of  new  associations  becomes  rather  small.  It  is  not  known  at  what  point  this 
occurs . 

Second,  if  both  of  two  conflicting  descriptors  meet  the  algorithm 
specifications  for  renaming,  then  the  one  selected  apparently  has  an  effect 
on  the  number  of  groups.  Results  of  the  analyses  do  not  indicate  which,  if 
any,  of  them  maintains  the  minimum  number  of  columns.  Indeed,  it  is  possible 
that  each  of  the  several  variations  should  be  followed  through  to  completion 
to  determine  which  is  best.  Even  a  few  conflicts  gives  a  large  number  of 
possible  combinations  of  actions. 

Third,  a  fairly  complex  renaming  process  is  necessary  to  achieve  a 
minimum  array.  It  has  been  shown  that  first-order  renaming  does  not  do  this. 
The  analysis  indicates  that  renamings  of  higher  order,  and  also  of  higher 
degrees,  are  required,  but  how  high  has  not  been  established.  In  any  case, 
the  number  of  possible  actions  to  be  investigated  rapidly  becomes  so  large 
as  to  pose  an  almost  prohibitive  processing  workload.  Even  second-order 
renaming — the  simplest  after  first  order — involves,  with  N  descriptors, 
looking  at  close  to  N(N  -  1)  possibilities.  It  appears,  but  has  not  been 
proved,  that  even  more  complex  renaming  is  needed  to  maintain  a  minimum 
array. 

Finally,  resolution  of  conflicts  using  the  algorithm  of  Figure  I  does 
not  necessarily  maintain  a  minimum  number  of  groups.  In  this  algorithm,  a 
renaming  process  is  initiated  only  if  a  new  association  occurs  for  two 
descriptors,  previously  mutually  exclusive,  in  the  same  column,  and  one  of 
them  at  least  is  involved  in  the  resultant  renaming.  It  is  perfectly 
possible,  however  (see  Chart  4E),  to  reduce  the  array  after  the  updating 
cycle  by  movement  of  one  or  more  descriptors  for  which  no  new  associations 
have  been  introduced;  indeed,  there  may  be  no  new  pairs  in  that  column.  It 
may  well  be  that  the  complete  array  should  be  reanalyzed  after  each  updating 
cycle  to  be  certain  that  it  is  the  minimum  achievable.  Otherwise,  it  may 
continue  to  expand  over  a  period  of  time  until  it  contains  several  more  than 
the  minimum  number  of  columns. 

No  attempt  has  been  made  toward  determining  the  approximate  complexity 
of  renaming  required  for  a  minimum  number  of  attribute  groups.  The  array 
already  is  so  large  that  a  reduction  of  even  3-4  columns  does  not  negate 
the  conclusions  of  the  next  subsection. 

g.  Evaluation  of  the  Multi-List  System  in  an  ISaR  Application. 
Evaluating  the  potential  usefulness  of  the  Multi-List  System  in  a  document 
retrieval  application  reduces  essentially  to  answering  one  question:  !’Is 
it  practicable  to  use  combinations  of  several  mutually  exclusive  index  terms 
as  superkeys  identifying  lists  into  which  the  file  is  organized?"  Based 
upon  the  studies  and  simulations  which  have  been  completed  on  the  DDC  data, 
the  answer  must  be  an  unqualified  "No."  As  a  corollary  to  substantiating 
this  conclusion,  it  is  possible  also  to  establish  the  general  data  character¬ 
istics  which  file  records  must  possess  if  the  approach  is  to  have  potential 
merit . 
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The  notation  S.  .  =  d  d  ,d,  is  used  for  a  superkey,  where  c  ,  c 
i,j  r,  a  s,  b  t,  c  *  •  '  r'  s 

and  ct  can  be  taken  as  three  consecutive  attribute  columns  and  the  d^  , 
t  t  #  a 

d,  .  and  d  are  the  groups  of  index  terms  in  each  column  combined  into 

the  superkey  range.  For  simplicity,  the  discus;  ion  assumes  that  each  column 
can  contain  about  the  same  average  number  of  terms  without  increasing  the 
number  of  columns  in  the  array.  Whether  or  not  the  assumption  is  true  does 
not  affect  the  analysis. 

The  pair  associations  of  the  599  most  common  index  terms  in  the  DOC 
array  require  an  array  of  56  exclusive  groups;  including  all  5540  terms  in 
the  sample  may  increase  this  to  about  60,  or  20  sets  of  three  columns.  The 
documents  in  the  sample  average  only  5.14  index  terms  each;  in  toe  latter 
part  of  the  chronological  period  represented,  this  increases  to  between 
7.5 -9.1,  depending  upon  security  classification.  In  the  entire  sample, 
only  059  documents  have  12  or  more  terms,  the  maximum  being  21.  The  attri¬ 
bute  group  array,  then,  has  several  times  as  many  columns  as  documents  have 
descriptors.  In  fact,  for  the  97.5 %  of  documents  with  11  or  fewer  describing 
terms,  there  are  at  least  five  times  as  many  columns. 

Although  the  distribution  of  index  terms  among  columns  is  not  random, 

it  is  evident  that  any  one  document  has  terms  in  only  a  few  of  them.  In 

fact,  if  it  has  an  index  term  in  the  range  d„  .  in  c._,  there  is  a  high 
.  r,  3  r 

probability  that  it  has  none  at  all  in  c  and  e  .  Yet  this  one  term  must  be 

s  t 

included  in  a  superkey  S.  .  which,  by  definition,  includes  a  range  of  terms 

1  #  J 

from  each  of  the  three  columns.  Fulfillment  of  this  requirement  leads  to 
the  concept  of  the  null  index  term,  ,  present  in  every  column  c..  and 

defined  as  being  exclusive  to  ail  real  terms.  In  this  example,  the  superkey 
then  becomes  a  A  A  and  it  is  quite  evident  that  most  superkeys  will  be  of 

this  type — probably  well  over  90%  of  them.  Some  will  have  two  real  terms, 

with  S.  .  expressible  in  the  form  d  d  .  A  ,  and  only  a  small  percentage 

will  have  a  term  from  each  of  the  three  columns.  For  practical  purposes, 

then  the  suoerkeys  for  most  of  the  index  terms  in  a  document  will  contain 

one  real  and  two  null  terms  and  be  of  the  form  S.  .  d  A  A^.  This  is 

j,j  r,a  s  t 

equivalent  to  establishing  a  separate  list  for  each  index  term  (or  small 
range  of  terms),  except  that  additional  bits  and  file-  storage  space  are 
needed  to  denote  the  nulls. 

As  a  direct  corollary,  the  lists  containing  only  a  single  real  index 
term  or  range,  although  relatively  few  in  number  (600  if  the  60-column  array 
is  broken  into  ten  ranges  per  column),  contain  most  of  the  entries  criss¬ 
crossing  the  file  storage  area,  but.  each  list  is  long.  Lists  entered  by 
superkeys  of  the  form  S.  .  -  d  A  or  d  d  ,  d  may  be  many  more  in 

number,  but  each  is  short. 

Because  index  term  assignment  to  columns  is  not  random,  it  is  possible 
that  some  ’’clustering’'  tendency  may  exist  in  actual  usage,  especially  if  the 
mutually  exclusive  columns  are  reordered.  To  investigate  this  possibility, 
the  high-frequency  index  terms  in  50  consecutive  documents  in  the  "9's" 
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sample  were  listed  by  their  rank  number  and  column  assignment  in  the  array 
of  Table  1.  (Because  only  those  with  two  or  more  high-frequency  descriptors 
are  involved,  the  selection  range  covered  61  documents.)  These  are  listed 
in  Table  4,  both  rank  numbers  and  attribute  group  columns  being  given  in 
ascending  order;  i.e.,  there  is  no  column  correspondence  between  the  two. 
Inspection  of  the  right  half  of  this  table  clearly  shows  that  almost  all 
superkeys  are  of  the  form  d  A  A  .  All  attempts  to  improve  results  by 

r  t  3  S  L 

reordering  columns  have  been  fruitless.  Because  it  does  not  include  the 
less  freHuently-used  index  terms,  this  tabie  not  fully  definitive.  However, 
the  dispersion  of  columns  is  so  great  that  even  their  optimum  placement  for 
each  document~-ari  impossible  expectance — would  not  significantly  alter  the 
table. 

Processing  search  requests  against  a  file  list-organized  in  this  manner 

introduces  a  similar  set  of  considerations.  Consider  a  subset  of  the  search 

request  in  which  terms  are  connected  by  a  logical  "and"  relationship.  In 

most  cases,  each  single  term  will  form  a  super  key  of  the  type  d  A  A  . 

r,  a  s  t 

But  it  is  not  sufficient  to  search  only  the  list  Ci.-eicd  by  this  superkey; 

all  lists  of  which  d  is  one  member  must  be  searched.  If  each  column  is 

-  r ,  a 

divided  into  ten  ranges,  this  is  100  lists,  of  which  some  may  have  no  entries. 
As  noted  above,  many  of  these  are  short,  but  nonetheless  the  records  they 
include  must  be  examined. 

It  is  concluded,  therefore,  that  the  grouping  of  index  terms  into 
mutually  exclusive  attribute  columns  and  the  organization  of  the  file  into 
lists  entered  by  multi-column  superkeys  has  no  potential  usefulness  in  a 
document  retrieval  application  and  is,  in  fact,  markedly  inferior  to  a 
straightforward  list-organized  file  in  which  each  term  has  its  own  list. 
Exhaustive  analyses  of  the  large  DDC  sample  show  that  the  typical  superkey 
contains  only  a  single  real  index  term  (or  range  of  terms).  The  Multi-List 
System  does  not  achieve  its  objective  of  minimizing  search  time.  Compared 
to  a  conventional  list-organized  file,  the  Multi-List  System  organization  not 
only  requires  searching  many  more  lists  in  processing  retrieval  requests, 
but  also  uses  considerably  mjre  data  storage  space  for  retention  of  the  pair- 
association  lists.  Finally,  maintenance  of  the  attribute  group  array  imposes 
an  additional  processing  workload  conservatively  estimated  to  be  two  or  more 
orders  of  magnitude  greater  than  that  needed  for  single-term  lists. 

It  may  be  argued  that  the  characteristics  of  the  DOC  file  are  not  repre¬ 
sentative  of  those  in  other  document  retrieval  applications  and  that  the 
above  conclusion  is  not  generally  true.  To  some  extent,  this  may  be  valid. 

We  are  unaware  of  any  published  results  of  analyses  into  file  characterists 
which  have  been  carried  out  on  the  scope  or  into  the  depth  of  those  on  the 
DDC  sample.  Those  data  which  have  been  noted  are  considered  to  be  consistent 
with  these  results.  In  fact,  there  is  some  reason  to  believe  that  more- 
specialized  document  files,  indexed  in  greater  depth  and  with  a  smaller 
thesaurus  than  the  DDC  library,  may  show  an  even  more  unfavorable  relation¬ 
ship  between  number  of  index  terms  per  document  and  the  size  of  the  mutually 
exclusive  attribute  group  array. 
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High-Usage  Descriptors  and  Their  Mutually  Exclusive  Attribute 
Group  Assignments  for  a  Selected  Range  of  50  Documents 


Doc. 

Number 

No. 

Descr. 

Rank  Numbers 

of  High-Usage  Descriptors* 

Kuiually-Sxclusive  Attribute 
Groups  (Table  1) 

230019 

6 

044 

305 

12 

30 

230029 

3 

174 

304 

13 

16 

230049 

9 

003 

009 

096 

178 

183 

185 

213 

359 

3 

7 

9 

10 

16 

28 

37 

47 

230059 

9 

003 

004 

090 

337 

352 

3 

4 

12 

19 

46 

230069 

5 

417 

526 

562 

a 

9 

20 

230079 

10 

083 

169 

176 

247 

256 

422 

445 

15 

17 

27 

28 

30 

39 

53 

230089 

6 

049 

076 

595 

3 

10 

22 

230099 

5 

Oil 

416 

11 

43 

230119 

12 

014 

036 

118 

318 

536 

3 

14 

16 

18 

34 

230139 

9 

001 

031 

131 

I64 

314 

455 

498 

1 

21 

24 

27 

43 

47 

52 

230149 

7 

003 

046 

078 

083 

252 

332 

3 

7 

24 

31 

59 

53 

230159 

8 

003 

009 

031 

169 

256 

358 

369 

3 

9 

18 

24 

28 

30 

51 

230179 

5 

064 

222 

384 

30 

42 

43 

230199 

8 

008 

C22 

386 

8 

a 

32 

230219 

5 

030 

143 

167 

237 

3 

21 

39 

41 

230229 

3 

030 

167 

578 

21 

39 

49 

230239 

12 

0C1 

007 

017 

031 

052 

059 

095 

182 

528 

1 

6 

n 

3 

12 

24 

25 

40  47 

230259 

5 

003 

150 

3 

29 

230269 

8 

004 

013 

094 

438 

4 

12 

14 

17 

230289 

8 

037 

152 

220 

240 

295 

344 

3 

16 

26 

32 

47 

49 

230309 

6 

041 

177 

192 

223 

3 

13 

24 

32 

230319 

8 

002 

115 

233 

281 

292 

493 

2 

25 

35 

43 

44 

43 

230329 

8 

035 

050 

139 

207 

353 

459 

502 

16 

26 

32 

43 

44 

47 

51 

230339 

4 

002 

093 

2 

29 

230349 

7 

311 

339 

4 

19 

230359 

5 

001 

057 

18"? 

592 

1 

9 

14 

17 

230379 

10 

020 

021 

147 

222 

490 

4% 

19 

20 

36 

40 

43 

55 

230389 

5 

001 

010 

087 

447 

\ 

10 

23 

41 

230419 

7 

009 

065 

20C 

391 

9 

13 

49 

51 

230439 

4 

003 

009 

167 

3 

9 

39 

230459 

9 

001 

004 

144 

147 

249 

262 

352 

544 

1 

4 

11 

26 

31 

36 

40 

46 

230469 

9 

001 

029 

091 

127 

413 

1 

14 

<4 

38 

a 

2305C9 

3 

173 

184 

426 

5 

40 

46 

230519 

10 

002 

007 

105 

239 

2 

5 

7 

31 

230539 

6 

001 

013 

063 

237 

1 

17 

32 

a 

230549 

7 

029 

232 

511 

3 

9 

14 

230579 

3 

002 

004 

029 

259 

2 

4 

12 

14 

230539 

c. 

047 

153 

195 

430 

509 

i 

14 

20 

34 

35 

230609 

5 

C01 

002 

i 

2 

230619 

15 

001 

036 

039 

316 

355 

429 

431 

436 

i 

3 

13 

26 

32 

40 

a 

52 

230639 

30 

175 

269 

433 

576 

4 

29 

45 

43 

230659 

6 

009 

141 

195 

9 

34 

37 

230639 

11 

003 

004 

009 

046 

094 

192 

203 

247 

300 

3 

u 

9 

14 

15 

17 

31 

52  51 

230699 

7 

076 

.208 

560 

4 

22 

230729 

10 

Oil 

076 

236 

11 

2* 

45 

230749 

6 

213 

404 

460 

493 

546 

578 

1 

23 

2  "5 

32 

46 

49 

230769 

7 

027 

051 

119 

193 

5 

? 

19 

31 

230789 

3 

052 

100 

361 

16 

17 

25 

230799 

t’ 

077 

*00 

5 

49 

230809 

6 

033 

074 

449 

501 

12 

15 

20 

3b 

*  Includes  only  documents  vita  two  or  more  nigh-u  sage  des  cr inters.  11  documents  in  this 
range  nave  none  or  one. 
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h.  Characteristics  of  a  File  Suitable  for  Multi-List  System  Organization. 
The  evaluation  implies  one  basic  characteristic  a  file  must  possess  if  the 
Multi-List  System  is  to  have  potential  usefulness:  There  must  be  a  number 
of  data  fields  having  a  range  or  set  of  different  values  as  possible  entries 
and  in  which  all,  or  nearly  all,  file  records  do  have  an  entry.  These  need 
not  comprise  all  the  data  fields  in  the  record;  it  can  be  divided  into  parts, 
one  consisting  of  fields  present  in  practically  all  records  and  the  other, 
the  variable  or  trailer  fields  which  may  or  may  not  exist  in  every  record. 

The  first  type  can  be  combined  into  superkeys;  the  second,  included  in  indi¬ 
vidual  lists. 

Many  types  of  files  have  records  with  this  characteristic.  In  general, 
the  fixed  fields  may  be  further  subdivided  into  two  categories.  First  are 
those  in  which  a  single  record  can  have  only  one  entry,  such  as  clock  number, 
base  hourly  wage  rate,  home  department,  number  of  dependents,  etc.,  in  a 
personal  file.  Each  such  field  may  be  considered  as  an  attribute,  for  which 
the  possible  values  are  by  definition  mutually  exclusive  and  no  procedure  is 
required  to  maintain  this  exclusiveness.  Second  are  those  fields  in  which  a 
single  record  may  have  multiple  entries,  such  as  foreign  language  proficiency 
and  higher  job  categories  for  which  a  person  is  qualified.  Here  exclusiveness 
is  not  an  a  priori  condition,  but  is  a  function  of  the  particular  entries 
which  exist  in  the  totality  of  file  records.  If  such  attributes  are  to  be 
divided  into  several  (two  or  more)  mutually  exclusive  groups,  then  the  file 
maintenance  procedure  must  provide  for  retaining  a  record  of  existing  asso¬ 
ciations  and  adjusting  the  groups  as  necessary  to  reflect  changes  in  the 
detail  entries. 

Attributes  with  only  a  single  possible  entry  per  record  probably  are 
most  susceptible  to  grouping  into  superkeys  when  a  list-type  file  organiza¬ 
tion  is  being  evaluated.  By  attaching  a  chain  address  to  groups  of  two  or 
three  data  fields  (attributes),  range  broken  into  superkeys,  rather  than  to 
each  one,  some  storage  space  and  input-output  transfer  time  always  can  be 
saved  in  handling  an  individual  record.  Additional  savings  may  accrue  by 
using  condensed  codes  for  the  superkeys  themselves.  Such  savings,  however, 
may  be  only  a  fairly  small  percentage  of  the  record  size  in  a  normal  list- 
organized  file. 

At  least  partially  offsetting  this  gain  is,  usually,  somewhat  increased 

complexity  and,  possibly,  additional  computer  time  both  in  maintaining  the 

file  and  in  processing  search  requests  into  the  lists.  This  arises  because 

a  search  may  not  —  and  normally  will  not — involve  all  of  the  attributes  grouped 

into  a  superkev,  but  will  be  of  the  form  d  A  A  or  d  d  ,A  ,  for  which 
*  *  r, a  s  t  r, a  s, b  t 

several  lists  must  be  searched.  Although  exactly  the  same  number  of  records 

may  be  examined  with  either  single-value  or  superkey  list  organization,  the 

latter  method  requires  the  extra  machine  instructions  and  computer  time  for 

accessing,  transferring  and  examining  records  in  many  lists  instead  of  just 

one.  Evaluation  of  the  relative  payoffs,  and  of  the  efficiency  of  organizing 

the  file  itself  into  lists,  must  be  predicated  upon  the  total  uses  of  the 

file  and  cannot  be  done  here. 

If  the  mutually  exclusive  attribute  groups  have  to  be  developed  and 
maintained  in  the  manner  necessary  for  a  document  retrieval  application,  then 
it  is  concluded  that  the  Multi-List  System  does  not.  constitute  a  feasible 
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method  of  file  organization  and  use.  Standard  approaches  are  superior  in 
terms  of  storage  requirements,  of  file  maintenance  complexity  and  processing 
time  and  of  file  searching  and  use.  It  may  result  in  reduced  storage  re¬ 
quirements  for  files  with  many  data  fields  of  the  ‘'attribute"  type,  but  in 
most  cases  will  require  mere  processing  time  for  file  updating  and  use. 


C.  ANALYSIS  OF  THE  BLACK-PATRICK  VARIATION  OF  A  DO CUMLinT- SEQUENCED  FILE 

D.  V.  Black  and  R.  L.  Patrick  have  suggested  [9]  a  variation  in  the 
document-sequence  file  as  a  means  of  realizing  greater  file-searching 
efficiency.  In  this  approach,  the  index  terms  for  each  document  are  ordered 
in  ascending  sequence  (as  one  possibility)  on  their  code  numbers  and  the 
file  records,  document  numbers  and  index  term  codes,  are  ordered  on  the 
string  of  code  numbers  considered  as  a  single  variable-length  key.  Where 
keys  are  identical,  records  are  in  document-number  sequence.  A  file  so 
organized  looks  like  this: 


Doc. 

No. 

Term 

1. 

Term 

2 

Term 

3 

Term 

4 

1000 

123 

9000 

123 

234 

7000 

123 

234 

345 

4000 

123 

234 

345 

567 

4001 

123 

234 

345 

567 

4002 

123 

234 

345 

567 

3053 

742 

999 

0123 

846 

978 

1235 

8421 

847 

1341 

9766 

954 

i 

It  will  be  observed  that  the  records  are  identical  to  those  in  the  normal 
document-sequenced  file,  in  which  index  terms  usually  are  carried  in  ascend¬ 
ing  sequence  on  code  number  within  each  document.  Only  the  record  sequence 
within  the  file  is  different. 

The  index  terms  in  a  request  (assumed  here  to  have  logical  "and" 
connectives)  are  converted  to  code  numbers  and  similarly  ordered  into  ascend¬ 
ing  sequence.  In  processing  the  request,  searching  need  continue  only 
through  that  portion  of  the  main  file  in  which  the  first  terms  are  equal  to 
or  less  than  the  first  term  of  the  request.  For  example,  if  a  search 
includes  the  terms  234-345-567,  the  search  through  the  "file"  in  the  table 
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above  terminates  after  document  number  4002.  Because  the  documents  beginning 
with  3053  include  no  index  term  less  than  (code  number)  742,  they  obviously 
cannot  meet  the  search  criteria.  In  the  portion  of  the  file  in  which  a  "hit" 
is  possible,  each  file  record  is  examined  by  a  conventional  comparison  sub¬ 
routine  to  determine  whether  or  not  it  meets  the  criteria. 

Does  this  approach  have  any  significant  potential  in  a  document  retrieval 
application?  Unquestionably,  it  permits  terminating  a  search  without  examin¬ 
ing  all  the  documents  in  the  file  and,  from  this  standpoint,  is  preferable  to 
a  straight  document-sequenced  organization.  The  percentage  of  the  file  records 
that  can  be  bypassed,  on  the  average,  has  not  been  reported.  In  fact,  so  far 
as  known,  the  proposal  has  not  been  tested  against  an  actual  file  of  document 
descriptions  and  a  representative  sample  of  search  requests. 

If  documents  are  ordered  on  the  lowest  index  term  code  in  their  descrip¬ 
tions,  there  is  obvious  tendency  for  the  file  records  to  be  clustered  among 
the  lower  code  numbers.  Further,  the  probability  of  having  a  low  code  in  a 
description  increases  with  the  number  of  terms  used.  Both  of  these  tendencies 
are  evident  in  this  summary  of  50  DDC  documents  classified  by  low  descriptor 
code  used.  (These  are  document  numbers  ending  in  "9"  in  the  DDC  accession 
number  range  229009-229499,  described  in  1960.  It  is  not  a  random  sample  but 
is  considered  roughly  typical  of  documents  accessed  during  that  period.) 


Low  Descriptor 

Code  Range 

Number  of 
Documents 

Average  Descr. 
per  Document 

Cum.  %  o 
Document 

0001-0199 

9 

0.67 

18  i 

0200-0399 

8 

7.75 

34 

0400-0599 

5 

8.40 

44 

0600-0799 

- 

- 

44 

0800-0999 

5 

9.20 

54 

1000- 1199 

- 

- 

54 

1200-1399 

8 

5.75 

70 

1400-1599 

1 

3.00 

72 

1600-1799 

2 

5,00 

76 

1800-1999 

1 

5.00 

78 

2000  «c  Up 

11 

4.91 

100 

Total 

50 

6.92 
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Only  two  documents  have  lowest  codes  greater  than  3000--3204  (three  descrip¬ 
tors)  and  4779  (four  descriptors).  Thus  almost  all  these  documents  have  at 
least  one  index  term  in  the  first  40$  of  the  descriptor  code  range  (maximum 
about  7000)  and  over  half  of  them  are  in  the  lowest  one-seventh  (below  1000). 
Because  DDC  codes  are  assigned  sequentially  to  descriptor  names  in  alpha¬ 
betic  order,  this  clustering  tendency  in  the  lower  number  range  is  equiva¬ 
lent  to  saying  that  most  documents  are  described  with  a  term  whose  first 
letter  is  early  in  the  alphabet. 

The  sequenced  codes  for  the  te’ms  in  a  set  of  average  search  requests 
likewise  have  a  clustering  tendency,  not.  necessarily  the  same  as  that 
exhibited  by  the  library  as  a  whole.  The  portion  of  the  file  that  can  be 
bypassed  in  processing  them  cannot  be  estimated  with  any  accuracy  without 
conducting  an  analysis  using  descriptions  of  a  reasonably  large  collection 
of  documents  (several  thousand,  at  least)  and  a  representative  cross-section 
of  search  requests. 

The  number  of  documents  examined  might  approximate,  for  example,  half 
the  library  if  the  average  request  meets  four  conditions:  (1)  The  number 
of  terms  is  fairly  small;  (2)  terms  have  only  logical  "and"  connectives; 

(3)  retrieval  is  based  upon  a  full  match  of  all  terms  and  not  varying  subsets 
of  those  in  the  request;  (4)  the  average  request  is  described  to  about  the 
same  degree  of  detail  as  the  average  document;  and  (5)  over  a  period  of 
time,  the  distribution  of  subject  classifications  in  search  requests  approxi¬ 
mates  that  of  the  document  library.  In  practice,  these  conditions  are  not 
met  and  the  general  effect  of  the  deviations  is  to  increase  the  portion  of 
the  file  which  must  be  searched. 

In  the  conventional  document-sequenced  file,  new  documents  can  be  added 
at  the  end  with  insertions  (if  any)  limited  to  the  latter  portions  of  the 
file.  In  th"  Black-Patrick  variation,  insertions  are  the  rule  and  the  entire 
file  must  ’  rewritten  on  each  updating  cycle.  To  this  extent  it  imposes  an 
additional  processing  workload  and  cost.  Although  no  experimental  data  have 
been  seen  to  support  the  conclusion,  it  appears  quite  possible  that  the 
method  is  preferable  to  the  standard  document-sequence  file,  where  a  saving 
of  even  10$  in  the  number  of  records  examined  may  be  profitable.  However, 
it  is  not  considered  competitive  with  either  the  inverted  sequence  or  list- 
organized  file  in  processing  search  requests.  It  is  applicable  only  with 
magnetic  tape  or  other  sequential-access  storage  medium  and,  despite  the  fact 
that  a  list-organized  file  is  twice  as  large,  the  latter  almost  invaribly 
will  result  in  lower  over-all  processing  time  and  cost. 

D.  OPTIMUM  ORGANIZATION  OF  A  DOCUMENT  RETRIEVAL  FILE 

There  seems  to  be  rather  general,  but  not  universal,  agreement  that, 
for  the  foreseeable  future,  automated  document  retrieval  will  based  upon 
searching  a  file  in  which  documents  are  described  by  index  terms  and  in  which 
the  request  terms  are  connected  by  varying  complexities  of  logical  "and," 

"or"  and  "but  not"  relationships.  There  also  appears  to  exist  rather  general 
concurrence-~possibly  not  quite  so  pronounced — that  only  the  inverted  se¬ 
quence  and  list-organized  files  provide  really  efficient  means  for  automated 
retrieval.  Certainly  only  these  two  can  be  considered  in  a  real-time  opera¬ 
tion,  which  demands  an  on-line,  mass-storage  (random  access)  device  for  the 
document  file. 


1 .  General  Comments  un  Factors  Affecting  File  Organization. 

The  most  efficient  detai led  form  of  file  organization  is  predicated  to 
some  extent  upon  characteristics  of  the  data  processor  and  its  storage 
devices.  For  example,  if  a  disc  file  or  drum  always  transfers  blocks  of  100 
characters,  nothing  can  be  done  about  it  (without  changing  the  equipment)  and 
the  detailed  file  design  and  use  specifications  must  take  this  fact  of  life 
into  account.  Insofar  as  internal  processing  and  data  storage  capabilities 
are  concerned,  practically  all  modern  (current  decade)  general-purpose  EDPM's 
are  quite  flexible  and  pose  no  basic  restrictions  on  the  type  of  file  organi¬ 
zation  established.  A  real-time  retrieval  operat ion--and  particularly  one 
in  which  a  person  is  permitted  to  "browse”  through  the  automated  file — requires 
some  type  of  query  (data  input)  and  display  (data  output)  device  connected  to 
the  processor.  Here  the  limitations  are  much  more  apt  to  be  those  of  the 
capabilities  —  and  cost — of  the  device  rather  than  those  of  the  rest  of  the 
processing  system.  Because  of  these  equipment-related  factors,  a  detailed 
file  layout  can  be  made  and  optimized  only  within  the  framework  of  the 
characteristics  of  a  specific  equipment  configuration. 

The  most  efficient  general  form  of  file  organization,  however,  depends 
largely  upon  the  requirements  the  file  processing  must  meet  and  the  environ¬ 
ment  in  which  the  operation  is  performed.  Consequently,  it  can  be  studied 
and  conclusions  can  be  reached.  This  is  true  despite  the  fact  that  require¬ 
ments  and  environments  are  quite  diverse  and,  at  first  glance,  it  might  seem 
that  the  optimum  file  organization  takes  many  forms,  depending  upon  the 
particular  conditions  applicable.  The  problem  can  be  reduced  to  manageable 
size  by  eliminating  those  phases  or  requirements  which  are  not  a  direct  part 
of  maintaining  the  index  file  to  be  searched  or  of  processing  requests 
against  it. 

As  examples,  the  procedures  for  maintaining  and  using  an  automated 
thesaurus  are  essentially  identical  for  both  list-organized  and  inverted 
files.  The  method  of  arriving  at  index  terms — manual  or  machine--and  of 
validating  them  against  the  thesaurus  is  a  function  independent  of  the 
organization  of  the  index  file.  The  accumulation  of  statistical  data  can 
be  done  in  about  the  same  way  with  either  type  of  file.  A  similarly  separate 
processing  function  is  that  of  maintaining  auxiliary  files  which  may  be  re¬ 
quired,  such  as  those  used  to  develop  significant  usage  associations  of 
index  terms.  Selection  of  documents  for  "current  awareness"  programs  occurs 
at  the  time  new  descriptions  are  entered  for  processing  and  also  is  independ¬ 
ent  of  the  particular  file  format  in  which  the  data  are  to  be  stored  for 
subsequent  searches. 

2.  Advantages  and  Disadvantages  of  Inverted  and  List-Organized  Files. 

The  organization,  content  and  use  of  the  index  term  file  are  predicated 
upon  the  requirements  of  the  search  algorithm  and  the  exact  nature  of  the 
output.  Both  the  inverted  and  list-organized  files  contain  only  document 
numbers  and  index  terms.  The  output  of  a  search  through  either  type  of  file, 
then,  is  limited  to  these  two  types  of  data.  Inclusion  in  the  output  of  such 
additional  information  as  titles,  abstracts  or  copies  of  documents  is  not 
possible  using  only  these  files,  hut  requires  one  or  more  additional  opera¬ 
tions.  These  are  not  part  of  the  direct  file  searching  process  and  may  or 
may  not  be  automated. 


3? 


V’SWVt' 


>,1 


s 


a.  Differences  in  Search  Outputs.  The  first  basic  difference  in  the 
use  of  these  files  is  the  nature  of  the  output.  For  practical  purposes, 
the  output  of  searching  an  inverted  file  is  a  list  of  each  document  number 
satisfying  the  search  criteria  plus,  if  desired,  the  list  of  index  terms 
upon  which  the  selection  was  based.  The  list-organized  file  can  produce 

not  only  the  document  list  but  also  all  index  terms  used  in  each  description. 
In  addition,  by  expanding  the  size  of  the  file  record,  such  other  data  as 
author's  name,  publication  or  journal,  date  of  publication,  etc.,  can  be 
incorporated  in  the  output.  This  is  possible  whether  or  not  such  fields 
are  used  in  the  same  manner  as  "normal"  index  terms. 

The  greater  output  flexibility  of  the  list-organized  file  points  out 
another  essential  difference  between  the  two  types.  The  fact  that  it  is 
based  upon  a  document  record  which  can  be  expanded  rather  easily  to  include 
more  data  than  the  basic  inoexing  terms  themselves  is  a  strong  incentive  to 
do  just  that.  Consequently,  the  evaluation  of  which  type  of  file  is  most 
efficient  usually  will  not  be  based  upon  two  different  organizations  of  the 
same  data  base.  Almost  inevitably,  the  list-organized  file  will  contain 
more  information  than  the  inverted  file. 

|  If  output  requirements  are  satisfied  by  a  list  of  document  numbers 

|  (plus,  at  most,  the  descriptors  upon  which  the  selection  is  based),  then 

;  either  type  of  file  organization  can  be  used.  If  additional  descriptive 

information  of  the  general  ..  °s  mentioned  above  are  postulated,  then  only 
;  the  list-organized  file  is  applicable. 

| 

b.  Differences  in  Nature  of  Search  Algorithm-  The  list-organized  file 

*  is  more  flexible  than  the  inverted  file  in  the  degree  of  sophistication  or 

complexity  permissible  in  the  search  algorithm.  The  list-organized  file 
can  be  used  for  any  type  of  search  possible  against  an  inverted  file.  In 
addition,  it  permits  stnrch  criteria  which  are  not  practicable  with  the 
latter  form  of  file  organization.  The  greater  capabilities  of  the  list- 
organized  file  arise  because,  in  processing  a  request,  it  makes  available 
more  data  than  does  the  inverted  file. 

The  relative  eyrees  of  search  complexity  may  be  summarized  in  this 
manner:  With  an  inverted  file,  all  index  terms  used  in  the  selection  must 
be  contained  in  the  rasic  search  request,  or  must  be  derivable  from  sources 
other  than  the  file  itself.  As  examples  of  the  latter,  the  input  terms  may 
be  expanded  based  upon  hierarchal  or  structural  relationships  carried  in  an 
(automated)  thesaurus,  or  upon  usage  association  data  contained  in  the 
thesaurus  or  other  file  which  can  be  accessed  with  an  index  term  as  key. 

In  addition  to  the  above,  the  list -organized  file  makes  it  possible  to  in¬ 
corporate  criteria  based  upon  terms  contained  in  document  records  accessed 
through  the  initially  given  terms.  The  additional  terms  so  obtained  are 
|  derivable  only  from  within  the  J.ist-orgar.ized  file  itself. 

]  The  applicability  of  the  two  files  to  some  of  the  more  commonly  proposed 

search  parameters  are  discussed  brielly: 

Both  can  handle  the  same  complexity  of  logical  ’•elationshi ps  between 

search  terms;  typically  limited  to  "and,"  "or"  and  "but  not"  connectives. 
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Both  have  the  same  capabilities  for  converting  between  external  and 
internal  language:  Term  names  to  index  term  codes,  non-indexing  names 
or  codes  to  indexing  codes,  external  index  codes  to  internal  codes,  etc. 

Both  files  can  handle  requests  when  all  terms  in  the  search  criteria 
are  included  in  the  request  input. 

Both  files  can  be  used  when  the  selection  criteria  can  include  subsets 
of  the  full  range  of  index  terms  (e.g.,  selection  of  all  documents 
containing  any  three  of  five  given  index  terms).  With  both  files, 
weight  factors  can  be  used  and  calculated  document  weight  factors  can 
be  part  of  the  output.  Also,  the  output  can  include  the  nurr'-.r  of 
terms  upon  which  selection  was  made,  or  a  list  of  the  terms,  jr  both. 

Both  files  can  be  used  if  the  basic  index  terms  of  the  request  are  to 
be  expanded  based  upon  term  relationships  included  in  the  thesaurus, 
with  or  without  weight  factors  assigned  to  the  additional  terms  so 
generated. 

Similarly,  both  can  be  used  with  expansion  of  the  list  of  terms  based 
upon  "significant'*  associations  of  terms  occurring  in  the  file  as  a 
whole.  Pairs,  triplets,  or  larger  numbers  of  terms  may  be  used  m  the 
determination  of  association  factors. 

In  the  above  two  cases,  both  files  permit  limiting  selection  of  docu¬ 
ments  to  those  meeting  specified  conditions  of  given  and  added  index 
terms. 

List-organized,  but  not  the  inverted,  file  permits  additional  search 
cycles  using  new  index  terms  included  in  documents  selected  during  the 
previous  cycle.  Here  it  is  understood  that  the  new  terms  are  found 
solely  because  of  their  inclusion  in  documents  selected  on  the  basis 
of  already-known  terms.  They  are  not  derivable  from  the  thesaurus. 

The  new  terms  can  be  weighted  and  combined  in  these  subsequent  search 
cycles  in  the  same  manner  as  the  original  terms. 

These  search  criteria  involve  data  other  than  what  are  generally  understood 
to  be  "index  terms,"  but  which  may  be  incorporated  into  the  search  file. 

Dates  (year  of  publication,  for  example)  can  be  a  search  criterion  with 
both  files.  In  the  list-organized  file,  date  is  included  in  each  docu¬ 
ment  record.  If  so,  it  can  be  part  of  the  serach  output  whether  or  not 
it  is  used  as  a  selection  criterion.  In  the  inverted  file,  each  time 
interval  is  set  up  as  an  index  term  record  containing  the  numbers  of 
all  documents  applicable.  (This  record  almost  always  has  thousands  of 
detail  entries.)  Dates  of  selected  documents  cannot  be  provided,  at 
least  practicably,  unless  specified  as  a  search  parameter. 

Author’ s  name,  with  an  inverted  fi’e,  car.  be  used  only  if  it  is  an  index 
term  in  the  basic  request  and,  further,  only  if  the  basic  file  has  a 
record,  for  each  author,  with  the  list  of  pertinent  document  numbers. 

For  practical  purpose,  it  is  not  possible  to  determine  the  .uthor  of  a 
document  selected  on  the  basis  of  other  terms,  even  though  the  file 
includes  the  above  record  for  each  author. 
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Author's  name,  with  a  list-organized  file,  can  be  included  readily  as 
par*,  of  the  output  of  all  searches,  provided  only  that  it  is  a  data 
field  in  each  document  field.  In  addition,  the  documents  for  each 
author  can  be  ■'chained”  into  a  list  accessible  through  an  enlarged 
entry  table.  If  this  is  done,  the  search  output  can  be  expanded  to 
include  ail  other  documents  by  the  authors  of  those  selected  during 
the  basic  search. 

Journal  or  publication  name  (usually  coded),  with  a  list -organized  file, 
can  be  included  as  part  of  the  search  output  in  the  same  manner  as 
author’s  name.  Although  this  field  also  can  be  placed  into  lists, 
there  is  considered  to  be  practically  no  advantage  in  doing  so.  With 
an  inverted  file,  this  field  is  subject  to  the  same  restrictions  as 
author's  name  and,  in  practice,  cannot  be  used. 

Role  indicators  for  index  terms  can  be  used  with  both  files.  Separate 
records  (inverted  file)  or  lists  can  be  set  up  for  each  role-term 
combination;  or,  alternatively,  a  single  record  or  list  can  be  estab¬ 
lished  foi  ther  term,  modifiers  associated  with  the  document  number 
specifying  the  applicable  role. 

Link  indicators  definitely  can  be  used  with  a  1  ist-orga.<ized  file. 

Their  use  introduces  several  complexities  with  an  inverted  file,  and 
it  is  not  known  if  they  can  be  incorporated  efficiently.  There  is  a 
good  deal  of  controversy  on  the  usefulness  of  link  indicators  in  a 
document  retrieval  application.  Analyses  of  their  effects  on  file 
organization  are  not  considered  warranted  at  this  time. 

c.  File  Maintenance  Differences.  Updating  a  list-organized  file 
requites  more  computing  than  an  inverted  file.  The  additional  operations 
are  those  necessary  to  create  the  chain  address  for  every  index  term  in  each 
new  document.  Inverted  file  updating  is  straightforward  and  simple:  Create 
word-pairs  for  each  new  index  term  and  document  number  combination,  sort 
into  (tern:)  sequence  and  merge  the  document  numbers  into  the  existing  term 
records.  With  serially  assigned  accession  numbers,  the  merging  occurs  only 
at  the  end  of  each  record  to  be  updated;  ideally,  new  numbers  are  added  only 
at  the  end  of  the  record.  In  general,  the  complete  record  for  each  index 
term  in  the  new  documents  is  read  and  completely  rewritten.  The  operations 
are  organized  most  efficiently  in  a  sequential  manner  and  even  the  sorting 
requires  relatively  little  computer  memory. 

The  most  efficient  algorithm  for  updating  a  list-organized  file  requires 
that  the  entire  index  term  entry  table  he  in  the  processor  memory.  If  this 
is  done,  the  chain  addresses  for  each  document  can  be  created  one  after  the 
other,  the  entry  table  being  updated  simultaneously,  and  the  document  trans¬ 
ferred  to  the  file  storage  medium  before  processing  the  next  one.  This 
approach  uses  a  quite  large  amount  of  memory-- two  words  or  about  ten  charac¬ 
ters —  for  each  index  term  in  the  thesaurus  and  in  many  cases  may  not  be 
practicable.  Alternative  methods  take  more  computing  time. 

In  the  tvpical  case  of  an  updating  cycle  with  about  500  new  document 
entries,  fewer  file  references  are  needed  with  a  list-organized  file.  Al¬ 
though  the  entire  entry  table  is  read  and  rewritten,  it  is  small  compared  to 
the  document  file  itself.  With  3  mass  storage  uevice,  one  access  is  required 
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for  each  document  record  processed;  two  may  be  needed.  With  magnetic  tape 
storage,  the  file  always  can  be  organized  so  that  new  documents  are  added  at 
each  end  (i.e.,  the  file  need  not  be  in  document  number  seeoence)  without 
rewriting  the  previously  existing  file.  With  an  inverted  jile,  a  record 
access  is  necessary  for  each  index  term  included  in  the  input.  For  typical 
updatings  with  small  document  volumes,  there  usually  are  several  times  as 
many  terms  as  documents.  An  inverted  file  requires  more  accesses  to  update 
the  index  file  than  does  a  list-organized  file. 

Periodic  file  purging  (elimination  of  documents)  is  somewhat  faster  with 
a  list-organized  file  than  with  an  inverted  file,  provided  that  the  purging 
involves  a  solid  block  of  the  oldest  documents  in  the  file.  This  is  done  so 
seldom--once  or  twice  a  year — that  it  is  not  an  important  factor  in  the 
selection  of  a  file  design.  However,  random  purging  also  is  not  only  possible, 
but  simple,  with  a  list-organized  file.  The  storage  space  occupied  by  the 
record  cannot  be  eliminated  because  of  the  need  for  retaining  the  chain 
addresses,  but  the  document  effectively  can  be  "killed"  by  flagging  or 
wiping  out  its  number.  Random  purging  can  be  done,  but  is  not  practicable, 
with  an  inverted  file 

d.  File  Storage  Comparison.  The  exact  method  of  setting  up  file  records 
on  the  storage  medium  depends  heavily  upon  the  specifications  of  the  storage 
device  itself  and  the  nature  of  data  transfers  to  and  from  the  central 
processor.  It  almost  never  is  possible  to  optimize  all  of  the  several  factors 
involved.  Among  the  more  important  are:  (1)  Utilization  of  the  data  storage 
space  available,  particularly  with  mass  storage  devices;  (2)  effective, 
rather  than  instantaneous,  transfer  rates  to  and  from  the  computer  memory, 
especially  with  sequential-access  storage;  (3)  access  time,  either  sequential 
or  random;  (4)  the  amount  of  memory  required  foi  input/output  data  transfers 
in  relation  to  the  total  available;  and  (5)  the  effects  of  file  design  on 
processing  time.  In  practice,  the  detail  file  design  is  a  compromise,  each 
of  several  conflicting  objectives  being  achieved  in  varying  degrees  (and, 
usually,  none  being  fully  realized). 

In  this  respect,  it  should  be  noted  that  the  degree  of  compromise 
necessary  varies  considerably  for  different  types  of  approaches  to  basic  file 
organization.  With  current  mass  storage  devices,  for  example,  it  is  con¬ 
sidered  much  more  difficult,  if  not  impossible,  to  set  up  a  list-organized 
file  which  will  come  as  close  to  realizing  its  potential  advantages  as  will 
an  inverted  file  on  the  same  device.  A  basic  file  organization  which  in 
theory  may  be  s  ^erior  or  preferable  to  another  may  in  practice  be  inferior 
or  less  efficif  • 

Because  r  '  the  varying  characteristics  of  storage  devices  and  their 
interfaces  with  the  rest  of  the  processing  system,  it  is  appropriate  to  make 
only  general  remarks  and  comparisons  on  implications  of  the  medium 
selected  on  the  list-organized  and  inverted  files. 

If  the  index  file  is  stored  on  magnetic  tape  or  similar  s*. ruential- 
access  devices,  comparable  efficiencies  can  be  achieved  with  either  type  of 
organization.  Tape  blocks  almost  invariably  are  fairly  long  to  attain  high 
effective  transfer  rates  and,  with  modern  equipment,  range  from  500  characters 
up;  larger  blocks  are  desirable  if  enough  memory  can  be  allocated  for  input- 
output  areas.  Thus  with  both  files,  a  number  of  records  are  packed  into  one 
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block.  With  an  inverted  file,  the  long  records  for  common  terms  may  be  split 
into  several  blocks.  The  condition  probably  never  arises  with  a  list- 
organized  file;  50  index  terms  for  a  document  (the  largest  number  report) 
creates  a  record  on  the  order  of  500  characters. 

Two  points  may  be  noted.  First,  the  list-organized  file  is  about  twice 
as  large  as  the  inverted  and  takes  twice  as  long  to  process.  Thus,  if  search 
criteria  are  within  the  scope  of  what  it  can  handle,  the  inverted  (ile  is 
preferable  when  sequential  access  storage  is  used.  Second,  if  a  list-organized 
file  is  used,  chain  addresses  must  carry  only  in  the  forward  direction  of  the 
tape.  In  practice,  this  results  in  mixing  records  of  various  sizes  within 
the  file.  Unless  the  equipment  includes  a  flexible  input-output  control  word 
system  (e.g.,  ’’scatter  read”),  time  to  search  out  individual  records  increases. 
Records  in  an  inverted  file  can  be  grouped  quite  easily  according  to  length 
(number  of  index  terms  included). 

Three  important  characteristics  affect  the  organization  of  a  file  on  a 
mass  storage  device,  such  as  a  disc  or  drum.  First,  the  random  access 
capability  requires  specifying  a  record’s  location  as  a  machine-fixed 
address — disc  surface,  track,  and  sector  within  track,  for  example.  This 
factor  causes  no  logical  difficulty  with  either  1^  organized  or  inverted 
files;  the  machine  addresses  need  not  be  the  same  '  document  numbers  or 
index  term  codes.  In  a  list-organized  file,  however,  their  use  as  chain 
addresses  almost  certainly  increases  the  size  of  each  record.  This  follows 
because  the  document  number,  which  is  what  really  is  being  chained,  seldom 
exceeds  six  decimal  digits,  or  20-24  bits,  while  machine  addresses  of  mass 
storage  sectors  usually  take  more  bits  then  this. 

Second,  in  many  equipments,  sectors  have  a  fixed  character  capacity, 
usually  in  the  60-200  range,  but  sometimes  larger.  Data  transfers  may  occur 
in  one  or  more  of  three  basic  ways:  (1)  One  complete  sector  at  a  time; 

(2)  one  sector,  with  the  transfer  terminated  when  the  actual  end  of  data  is 
reached;  and  (3)  multiple  sectors,  variable  in  number,  transferred  at  a  time. 
With  both  types  of  files,  compromises  are  necessary  to  fit  the  variable- 
length  records  into  fixeu-length  sectors  and  to  handle  long  records  which 
cannot  be  contained  within  a  single  sector.  A  few  equipments  provide  foi 
truly  variable-length  sectors,  one  sector  terminating  and  the  next  beginning 
immediately  after  the  end  of  each  record.  Thus  one  track  can  have  a  variable 
number  of  sectors,  each  of  different  length,  track  capacity  setting  the  maxi¬ 
mum  sector  size.  This  facility  is  well-adapted  for  files  in  which  records 
are  variable  in  length  but,  once  established,  are  essentially  static--i .e. , 
do  not  expand  or  contract  during  subsequent  processing.  This  is  a  basic 
characteristic  of  a  document  description  and  thus  a  list-organized  file  can 
readily  utilize  variable-sector  storage. 

Third,  mass  storage  devices  are  relatively  more  expensive,  per  bit,  than 
magnetic  tapes  and  in  operation  the  entire  file  must  be  available  to  the 
central  processor.  Thus  it  is  desirable  to  utilize  a  high  percentage  of  the 
available  bit  capacity  for  data  storage.  This  may  be  difficult  to  achieve 
with  fixed-length  sectors  and  a  list-organized  file,  where  the  maximum  record 
may  be  4-10  times  as  long  as  the  minimum.  Here  it  is  doubtful  if  utilization 
of  as  much  as  70%  can  be  realized  without  sacrificing  some  of  the  potential 
advantages  of  this  method  of  file  organization.  With  variable-length  sectors 
and  one  record  per  sector,  the  utilization  may  be  somewhat  better.  Some 
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track  capacity  is  used  to  record  the  machine  sector  addresses  and  other 
handware  signals  associated  with  variable-length  data  blocks,  formally,  this 
is  equivalent  to  many  bits  and,  for  the  short  records  typical  of  document 
descriptions,  may  take  15%  or  more  of  the  capacity  potentially  usable.  In 
addition,  the  machine  addresses  tend  to  fairly  long;  if  used  as  the  chain 
addresses  within  each  record,  their  greater  length  (than  document  numbers) 
further  reduces  the  effective  data  storage  capacity. 

These  factors  are  not  so  important  with  an  inverted  file,  whose  records 
increase  in  size  with  time  and  whose  growth  factor  is  taken  into  account  in 
file  design  and  storage  allocation.  Internal  index  term  codes  easily  can  be 
made  the  same  as  machine  sector  addresses  and  term  records  of  like  sizes  can 
be  grouped  readily  to  utilize  most  of  the  capacity  of  fixed-length  sectors. 

If  variable-length  sectors  are  used,  the  machine  addresses  take  a  much  smaller 
percentage  of  track  capacity  because  the  average  index  term  record  is  much 
longer  than  the  average  document  description  (a  7:1  ratio  in  the  DDC  sample 
and  this  probably  is  lower  than  in  the  typical  document  retrieval  applica- 
t  ion) . 


e.  Comparison  of  Search-Request.  Processing  Requirements.  Four  factors 
affecting  the  processing  of  search  requests  may  be  noted:  (1)  Number  of 
records  accessed  or  acted  upon;  (2)  amount  of  data  transferred  into  the 
processor  memory;  (3)  amount  of  computing  necessary  to  determine  the  documents 
meeting  the  search  criteria;  and  (4)  the  amount  of  memory  required  to  hold, 
data  and  the  program. 

It  has  been  noted  that,  with  an  inverted  file,  one  record  is  accessed 
for  each  index  term  in  the  request,  some  of  them  being  very  long.  Their 
number  seldom  exceeds  20.  With  a  list-organized  file,  the  number  of  accesses 
is  highly  variable,  but  the  individual  records  are  short.  The  ideal  search 
here  is  one  in  which  the  request  contains  an  infrequently  used  term  connected 
by  a  logical  "and"  relationship  to  all  its  other  terms.  Then  only  the  docu¬ 
ments  in  this  one  short  list  need  be  accessed.  The  case  is  not  considered 
typical.  The  common  term  may  not  be  infrequently  used.  The  request  may  not 
be  simple,  but  contain  two  or  more  subsets,  each  with  one  term  having  the 
desired  "and"  relationships.  Or  the  selection  criterion  may  be  based  upon 
partial  matching  against  terms  in  the  request.  The  "average"  search  against 
a  list-organized  file,  then,  requires  traversing  several  lists  and,  although 
shortest  lists  can  be  selected  whenever  possible,  the  total  number  of  records 
accessed  is  fairly  large  and  several  times  as  many  as  with  the  inverted  file. 
It  may  also  be  noted  that  a  variable  percentage  of  records  will  be  accessed 
and  processed  two  or  more  times  because  they  belong  to  more  than  one  of  the 
lists  involved.  Consequently,  total  access  lime — in  the  15-75  millisecond 
range  for  typical  mass  storage  devices — normally  is  several  times  as  long 
with  a  list-organized  as  with  an  inverted  file.  This  is  an  important  design 
consideration  for  a  real-time  document  retrieval  application. 

The  amount  of  data  transferred  into  the  processor  is  the  product  of  the 
number  of  records  accessed  and  their  average  length.  In  s  list-organized 
file,  the  average  length  of  records  examined  is  about  the  same  as  that  of 
the  file  as  a  whole.  This  is  not  true  of  an  inverted  file.  An  examination 
of  a  number  of  requests  and  some  published  data  on  this  aspect  indicate  that 
the  average  length  (number  of  documents)  of  search  terms  is  considerably 
larger  than  that  of  the  index  terms  in  the  file  as  a  whole.  This  is 
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tantamount  to  saying  that  search  requests  typically  contain  several  rather 
common  terms.  (In  a  list-organized  file,  this  means  that  the  average  length 
of  the  lists  in  a  request  are  greater  than  that  of  the  total  file.)  No 
definitive  data  have  been  obtained  as  to  which  type  of  file  organization 
results  in  the  transfer  of  the  lesser  amount  of  data.  However,  an  answer  to 
this  question  may  not  be  of  major  importance.  With  most  current  equipments, 
data  transfers  occur  at  very  high  speeds.  With  mass  storage  devices,  access 
time  for  a  record  greatly  exceeds  the  actual  transfer  time  of  all  except 
extremely  large  blocks  of  data. 

Except  for  control  and  input-output  programming,  the  computing  time 
necessary  to  process  a  search  request  is  largely  a  function  of  the  number  of 
comparisons  made.  This  is  easily  determinable  with  an  in  erted  file  in  which 
the  comparisons  are  made  against  the  sequenced  list  of  document  numbers  in 
the  record  for  each  index  term  and  a  similarly  ordered  list  of  document 
numbers  meeting  the  search  criteria  to  the  current  point  of  processing.  The 
number  of  comparisons  effectively  is  the  same  as  the  total  of  the  document 
numbers  read  in  with  all  index  terms  in  the  request  and  is  independent  of 
the  order  in  which  the  terms  are  processed.  (Actually,  it  is  i  little  lesa, 
because  the  two  lists  usually  are  not  exhausted  simultaneously.) 

With  a  list-organized  file,  the  number  of  comparisons  is  not  easily 
predictable.  All  pertinent  index  terms  in  the  request  must  be  examined  and 
pass  the  search  criteria  to  accept  a  document.  It  is  rejected  at  the  first 
i  failure  to  pass  a  selection  criterion  and  this  occurs  after  examining  a 

variable  number  of  index  terms.  No  reports  of  analyses  into  this  phase  have 
|  been  seen.  Second,  and  more  important,  the  number  of  comparisons  is  highly 

!  dependent  upon  the  order  in  which  the  index  terms  are  processed.  Within 

each  document  record,  the  terms  are  in  some  prescribed  order,  which  without 
loss  of  generality  can  be  assumed  to  be  ascending  sequence  on  index  term 
code.  Unless  the  terns  in  the  request  can  be  taken  in  the  same  sequence, 
the  record  may  be  scanned  several  times  to  find  individual  terms,  each 
*  scanning  involving  several  comparisons.  It  is  considered  probable  that  the 

request  terms  can  be  so  ordered,  but  the  comparisons  subroutines  probably 
are  longer,  and  take  more  computing  time,  than  the  straightforward  "accept- 
reject"  possible  with  an  inverted  file. 

The  program  for  processing  search  requests  against  an  inverted  file 
appears  to  be  less  complex  than  that  for  a  list-organized  file  and  thus  to 
require  a  somewhat  smaller  amount  of  computer  memory.  However,  the  inverted 
file  needs  much  more  memory  for  data  stroage.  If  list  organized,  each  docu¬ 
ment  record  is  accepted  or  rejected  on  the  spot  and  no  intermediate  data  are 
carried  over  from  one  to  the  next.  If  inverted,  an  intermediate  list  of 
document  numbers  is  carried  over  to  each  successive  index  term  and  memory 
must  be  allocated  to  hold  it.  This  list  may  be  fairly  long — several  hundred 
documents  at  some  stages  of  the  processing — or  there  may  be  more  than  one  of 
the  i,  depending  upon  the  logical  complexity  of  the  request  and  the  order  in 
which  the  terms  are  processed  addition,  with  a  mass  storage  device,  its 

data  input  area  is  large,  be  it  is  necessary  (or  at  least  highly 

desirable)  to  provide  for  reao..»y  successive  blocks  of  several  hundred  words 
each  for  index  terms  appearing  in  many  documents.  On  the  other  hand,  the 
input  area  with  a  list-organized  file  only  need  be  large  enough  to  handle 
the  longest  document  description.  If  magnetic  tape  is  used,  the  blocks  are 
about  the  same  size  with  either  file  and  the  input  areas  therefore  are 
comparable. 
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rtith  batch  processing  of  search  requests  against  an  inverted  file  on 
magnetic  tape,  intermediate  data  storage  requirements  often  are  so  large 
that  the  processing  of  the  main  file  is  limited  to  writing  out  a  "work  tape" 
of  the  records  for  the  index  terms  involved.  Subsequently  each  request  is 
processed,  one  after  the  other,  against  this  small  "work  tape."  Batched 
searching  against  document-sequenced  or  list-organized  files  can  be  done  as 
each  successive  record  is  read  in,  although  the  latter  type  of  organization 
may  introduce  a  rather  complex  control  program  to  handle  the  multiple  lists 
being  followed.  Use  of  mass  storage  devices  eliminates  this  type  of  batch 
processing;  each  request  is  acted  upon  individually  even  if  several  are 
received  at  one  tim**. 


3 .  Determination  of  Optimum  File  Organization  for  Document  Retrieval . 

From  the  discussion  of  the  previous  two  sections,  it  is  considered  that 
the  inverted  file  is  the  more  efficient  organization  if  the  types  of  searches 
it  can  accept  and  the  output  data  it  provides  meet  the  application  require¬ 
ments.  This  is  true  for  both  sequential  and  random  access  file  storage. 

The  file  is  smaller  than  any  other  except  the  straight  document-sequenced 
organization;  is  simple  to  maintain;  requires  fewer  record  accesses  in 
processing  a  search  request;  probably  selects  documents  with  considerably 
less  internal  computing;  and  is  susceptible  to  efficient  operation  with 
either  sequential  or  random  access  types  of  file  storage. 

The  basic  disadvantages  of  the  inverted  file  relate  to  the  scope  or 
complexity  of  search  criteria  which  are  permissible  and  to  its  restricted 
output  in  response  to  search  requests.  Although  it  may  be  granted  that  the 
inverted  organization  adequately  meets  the  requirements  of  many,  if  not  most, 
existing  document  retrieval  applications,  there  appears  to  be  a  definite 
trend  toward  more  complex  and  sophisticated  search  criteria  and  more  data, 
short  of  abstracts,  in  the  output.  These  are  inevitable — and,  on  the  whole, 
desirable — tendencies  for  an  application  which  has  a  relatively  short  history 
of  mechanized  processing.  Progressively  increasing  complexity  and  sophisti¬ 
cation  have  typified  virtually  every  application  converted  to  electronic 
processing  systems,  and  there  is  no  reason  to  think  that  document  retrieval 
is  any  different.  As  a  matter  of  fact,  it  is  doubtful  if  there  is  much 
justification  for  such  a  system  if  it  accomplishes  no  more  than  can  be  done, 
for  example,  with  "peek-a-boo"  cards. 

Many  of  these  ramifications  are  based  upon  data  either  already  contained, 
or  easily  included,  in  files  with  document-oriented  records.  Also,  they 
often  are  directed  toward  an  ultimate  real-time  operation  requiring  random 
access  to  file  records  and.  at  some  point,  remote  query-display  devices  and 
the  resultant  ability  of  the  requester  to  control  and  modify  the  handling 
of  his  query  as  a  part  of  its  processing. 

The  question  then  arises:  Is  the  list-organized  file  the  most  efficient 
method  of  storing  a  document  description  file  when  the  inverted  sequence  will 
not  meet  the  requirements  of  the  application?  After  careful  analysis  and 
evaluation  of  the  factors  and  implications  involved,  it  is  our  opinion  that 
the  answer  must  be  an  unqualified  "No."  If  a  list-organized  file  meets  the 
processing  requirements  of  a  document  retrieval  application,  then  a  conventional 
inverted  file  together  with  a  conventional  document-sequenced  file  constitutes 
a  more  efficient  and  preferable  form  of  data  storage. 
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This  statement  is  not  particularly  difficult  to  substantiate.  In  fact, 
the  suggested  organization  is  a  direct  and  immediate  product  of  analyzing  a 
list-organized  file  and  its  processing  implications.  Much  of  the  rather 
voluminous  literature  on  this  method  of  file  organization  seems  to  assume 
that  it  is  a  new  methodology  and  is  devoted  to  the  design,  use  and  manipula¬ 
tion  of  lists.  This  aporoach  has  been  made  possible  by  adding  large-capacity, 
random- access  storage  devices  to  the  electronic  data  processor,  the  complete 
system  removing  the  nectssity  for  essentially  sequential  processing  which 
characterizes  earl  .er  types  of  data  processors.  Too  little  attention  has 
been  paid  to  what  a  list-organized  file  really  is  or  to  the  conditions  under 
which  it  may  be  the  optimum  form  for  storing  data  to  be  processed. 

File  organization  and  design  always  have  been  predicated  upon  the  media 
available  for  data  storage,  the  processing  to  be  done  uDon  the  data  and  the 
characteristics  of  the  "tools"  available  to  do  the  processing.  They  still 
are.  These  three  factors  are  heavily  interdependent.  The  principle  of  the 
list-organized  file  is  not  new,  but  its  manifestations  and  method  of  use 
differ,  of  course,  when  random  rather  than  basically  sequential  access  to 
records  becomes  possible. 

The  closest  counterparts  to  list-organized  files  are  found  in  those 
processed  manually,  where  at  least  quasi-random  access  is  possible.  (Tech¬ 
nically,  access  to  discs  and  drums  also  is  quasi-random.)  One  of  the  oldest 
is  the  list  of  synonyms  and  antonyms  given  for  many  words  in  a  dictionary  or 
thesaurus.  This  is  a  direct  counterpart;  the  cross-references  are  chain 
addresses  leading  to  other  file  records  having  something  in  common  with  the 
current  one.  Somewhat  less  obvious  is  the  widespread  use  of  colored  flags 
or  inserts  in  visible  record  or  vertical  files  to  identify  records  possessing 
a  similar  attribute  value;  moreover,  one  record  can  belong  to  several  dif¬ 
ferent  lists.  In  this  case,  the  flag  merely  identifies  a  record  having  a 
specific  attribute  and  does  not  "chain"  to  the  next  record  in  the  list, 
i  is  a  difference  in  technique  arising  because  of  the  particular  charac- 
tt  tics  of  the  file  storage  media  and  the  manual  processing  against  it. 

It  i..~,es  possible  the  processing  of  all  records  on  a  "list"  on  a  quasi¬ 
random  basis  and  without  the  necessity  of  examining  every  entry  in  the  file. 
This  is  exactly  the  objective  of  a  list-organized  file  in  in  electronic 
computer  application.  The  use  of  edge-notched  cards  makes  possible  an 
approach  logically  the  same  as  that  described  above  and  adds  a  degree  of 
"mechanization"  to  finding  the  records  in  one  list. 

There  is  no  close  counterpart  to  list  organization  in  processing  systems 
based  upon  punch  cards  or  embossed  plates  as  the  file  storage  medium.  This 
arises  because  the  various  equipments  found  in  these  systems  handle  files 
purely  on  a  sequential  basis.  Maintaining  two  cards  of  the  same  basic  data 
in  different  sequences  is  somewhat  analogous,  the  filing  keys  of  one  desk 
corresponding  to  lists  into  the  other. 

Records  in  a  list-organized  file  can  be  accessed  in  one  of  two  ways. 
First,  they  can  be  located  by  the  keys  upon  which  the  file  is  sequenced, 
each  record  being  in  a  specific  location  relative  to  all  others.  A  record 
may  be  found  either  by  sequential  search  of  the  file  or,  if  the  storage  and 
processing  system  permits,  on  a  random  access  basis.  Second,  records  having 
some  attribute  in  common  can  be  located  bv  entering  the  list  for  that  attri¬ 
bute  and,  using  the  chain  addresses  or  tags,  finding  each  related  record 
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in  sequence.  In  practice,  the  technique  is  confined  to  systems  permitting 
essentially  random  access  to  any  desired  record. 

A  record  in  a  list-organized  file  contains  two  types  of  data  fields. 

First  are  those  which  pertain  to  the  record  itself — in  a  document  file,  these 
are  the  index  terms,  author,  journal,  date  of  publication,  etc.,  which  describe 
a  given  document.  Second  are  the  chain  addresses,  each  of  which  links  the 
record  to  another  one  having  the  same  attribute  value  for  the  data  field 
linked.  These  chain  addresses  do  not  pertain  to  the  record  and  add  nothing 
to  the  information  contained  in  the  first  type  of  data  field.  Elimination 
of  all  chain  addresses  in  the  file  removes  absolutely  no  information;  all 
it  removes  is  one  method  of  entering  it. 

Assume  there  exists  a  list-organized  index  file  for  document  retrieval, 
with  document  numbers  as  chain  addresses.  The  entry  word  for  index  term  A 
(List  A)  gives  a  document  number  containing  A.  This  document  record  in  turn 
includes  a  chain  address  which  is  the  number  of  another  document  containing 
A;  and  so  on,  the  chain  address  of  the  last  document  in  List  A  containing  a 
unique  code  signifying  "end-of-list."  All  of  the  chain  addresses  linked 
from  the  entry  word  for  index  term  A  can  be  removed  from  the  file  and  set  up 
as  a  record  for  A.  What  is  the  nature  of  this  record?  Index  term  A  fol¬ 
lowed  by  all  document  numbers  in  which  it  appears.  This  is  exactly  the 
record  for  index  term  A  in  an  inverted  file. 

The  process  of  removing  chain  addresses  from  the  list-organized  file 
and  creating  index  term  records  can  be  repeated  for  all  terms  in  the  entry 
word  table.  Upon  completion,  the  file  has  been  split  into  two  parts.  The 
index  terms  and  the  chain  addresses  constitute  a  normal  inverted  file.  The 
original  list-organized  file,  now  with  all  chain  addresses  eliminated,  is  a 
normal  document-sequenced  file.  Thus,  a  list-organized  document  retrieval 
file  is  a  direct  merger  of  the  conventional  document-sequenced  and  inverted 
f i les.  Specifically,  it  is  a  document-sequenced  file  to  which  has  been 
added,  as  chain  addresses,  the  index  term  records  of  the  inverted  file. 

The  combination  of  an  inverted  and  document-sequenced  file  is  one 
alternate  way  of  setting  up  exactly  the  same  information  as  is  contained  in 
a  list-organized  file.  Because  an  inverted  file  record  not  only  corresponds 
to,  but  also  is,  a  list  of  chain  addresses,  it  can  be  used  exactly  as  they 
are  used  in  a  list-organized  file.  There  is  no  mandatory  reason  for  a 
record  in  the  file  to  contain  the  chain  address  of  another  one  in  the  list. 

The  list  of  chain  addresse.  can  just  as  well  be  successive  entries  in  a 
separate  record.  The  inverted  and  document-sequenced  files  permit,  carrying 
out  any  type  of  processing  possible  with  alist  organization  and,  in  addition, 
enable  execution  of  operations  peculiar  to  the  inverted  sequence. 

This  dual  file  has  several  advantages  over  the  list-organized  cne: 

File  updating  is  simpler  and  faster.  It  is  unnecessary  to  perform 

the  operations  required  to  insert  chain  addresses  within  a  single 

file. 


Search  comparisons  can  be  based  upon  index  term  operations  in  the 
normal  manner  of  the  inverted  file.  This  requires  access  to  only 
a  few  records  and,  usually,  less  computing  than  operating  on  lists. 

The  complete  records  for  selected  documents  must  be  obtained  from 
the  other  file,  but  the  total  number  of  accesses  almost  always  is 
much  less  than  with  the  list-organized  file. 

Searches  can  be  conducted  against  documents  in  lists  it  considered 
appropriate  or  faster.  By  incorporating  suitable  criteria,  such 
as  presence  in  the  request  of  an  infrequently  used  index  term,  the 
search  program  can  be  modified  to  select  the  type  of  search  which 
probably  will  be  completed  ^«test  or  most  efficiently. 

Searches  against  index  term  lists  transier  less  than  half  as  much 
data  into  memory  as  the  conventional  list-organized  file,  because 
there  are  no  extraneous  chain  addresses  in  the  document-sequenced 
index  file.  The  chaining  itself  also  is  simpler  and  faster;  the 
next  document  number  is  in  a  known  location  in  the  inverted  file 
record  which  serves  as  entry,  rather  than  in  an  unknown  position 
in  the  record  currently  being  processed. 

If  desired,  searches  can  be  a  combination  of  the  inverted  and  list- 
organized  approaches.  That  is,  comparison  of  index  term  records 
can  continue  until  the  number  of  documents  so  far  meeting  the 
criteria  is  small,  at  which  time  document  records  can  be  scanned. 

The  intermediate  group  of  document  numbers  serves  as  the  entry  list. 

The  possibility  exists  of  organizing  the  document-sequence  file  in 
a  manner  which  will  reduce  the  access  time  to  its  records.  This 
arises  because  all  document  numbers  in  a  list,  or  selected  in 
processing  the  search  request,  are  known  before  any  of  them  are 
accessed.  If  the  records  are  suitably  organized  on  the  mass 
storage  device,  the  order  of  picking  up  records  can  be  chosen  to 
reduce  the  average  access  time  well  under  that  possible  with  a 
random  search. 

The  advantages  and  flexibility  of  the  dual  file  technique  indicate  that  it 
is  a  preferable  and  more  efficient  approach  than  the  conventional  list- 
organized  file.  Detailed  analysis  of  the  use  of  the  dual  file  to  process 
lists  has  revealed  only  one  disadvantage,  considered  to  be  of  minor  impor¬ 
tance:  More  memory  must  be  allocated  to  hold  the  document  numbers  or  other 
identifying  keys  of  the  records  in  tl.e  I’st..  In  practice,  long  lists  of 
keys  would  be  subdivided  and  several  accesses  made  for  the  complete  list. 

At  50  keys  per  subdivision,  the  dual  file  approach  requires  2%  more  record 
accesses  than  does  the  list-organized  file. 

Although  r.his  study  of  the  implications  of  the  list-organized  file  has 
been  conducted  with  specific  reference  to  a  document  retrieval  application, 
the  conclusions  apply  to  many  other  applications  in  which  it  is  a  possible 
method  of  file  organization.  The  document  index  file  differs  from  most  other 
business  data  files  in  two  significant  respects.  First,  a  document  descrip¬ 
tion  record  once  established  in  the  file  remains  static  and  unchanged  until 
finally  it  is  removed  completely.  Its  field  entries  do  not  change  and  its 
length  does  not  vary  by  the  addition  and  deletion  of  temporary  "trailer”  data. 


Consequently,  the  lists  to  which  it  belongs  remain  fixed.  Also,  the  list:' 
themselves  change  only  as  documents  are  added  to  or  deleted  from  the  file, 
not  from  processing  actions  on  records  already  in  the  file.  Changes  in 
field  entries  and  variations  in  "trailer"  data  are  normal  occurrences  in 
processing  most  other  files  and  the  lists  to  which  a  record  belongs  change, 
or  can  change,  as  a  result  of  roitine  processing.  Second,  most  of  the 
references  to  a  document  description  file  are  not  made  on  its  identifying 
and  sequencing  key  (document  number),  but  upon  an  attribute  value  (index 
term)  it  contains.  Again  this  is  atypical;  most  files  have  many  references 
based  upon  indexing  keys  and  relatively  fewer  upon  attribute  values. 

A  parts  fil  used  for  stock  and  inventory  control  purposes  is  a  typical 
example  of  a  bus iness-type  data  file.  Some  military  activities,  at  least, 
have  est;  blished  parts  files  in  list-organized  foim  and  are  processing 
against  them.  Because  many  of  the  processing  actions  are  routine  orders  for 
or  receipts  of  material* the  file  is  established  in  part  number  (or  stock 
number)  sequence  ancl,  in  these  common  cases,  access  to  a  record  is  through 
this  filing  key.  However,  a  veriety  of  other  demands  are  placed  upon  the 
file.  Typical  examples  are:  All  parts  used  in  a  given  equipment;  ail  parts 
obtainable  from  a  specified  supplier;  all  parts  currently  on  order;  all 
parts  with  a  cost  of  $1.50-$1.99;  and  all  parts  whose  st^ck  position  is 
below  their  established  low  limits.  Records  with  attributes  of  these  types 
obviously  can  be  chained  together  in  a  list-organized  file.  In  many  cases, 
the  required  output  of  processing  a  list  is  more  than  the  part  numbers  and 
access  to  all  or  a  portion  of  their  file  records  is  necessary. 

It  is  considered  that  a  list-organized  parts  file  is  less  efficient  and 
not  preferable  to  a  dual  file.  The  latter  is  easier  to  maintain  and  update. t 
The  routine  processing  actions  transfer  shorter  records  because  there  are  no 
superfluous  chain  addresses  in  the  part  numbei  file  itself.  Many  of  the 
lists  are  referenced  at  relatively  infrequent  intervals  and  the  chain  address 
records  might  be  stored  mere  economically  on  a  medium  less  expensive  than  a 
mass  storage  device.  It  is  conceded  readily  that  the  more  efficient  process¬ 
ing  and  lesser  computing  time  attainable  with  the  dual  file  may  be  more 
potential  than  realizable.  Access  time  to  records  may  dwarf  actual  data 
transfer  and  computing  time  and  this  may  make  any  time  saving  relatively 
insignificant.  There  is  no  practical  advantage  of  devising  a  more  efficient 
system  unless  productive  use  can  be  made  of  the  time  or  memory  saved,  or 
unless  comparable  results  can  be  achieved  with  a  smaller  amount  of  hardware. 

Nonetheless,  it  does  not  appear  unreasonable  to  expect  that  the  list- 
organized  file  compete  and  be  evaluated  on  its  own'  merits  against  alternative 
methods  of  data  storage  and  processing.  Tacit  assumption  of  its  efficiency 
without  recognizing  its  disadvantages  can  lead  to  using  list  organization  in 
applications  where  other  approaches  may  result  in  markedly  lower  lime  or  cost 
of  processing.  The  list-organized  file  unquestionably  has  a  role  in  modern 
processing  systems.  It  is  highly  desirable  to  analyze  and  delineate  the 
conditions  under  which  it — and  other  forms  of  data  organization — can  be  used 
most  efficiently. 
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4 .  Detail  Design  of  Inverted  and  Document -Sequenced  Files. 


This  section  proposes  a  basic  method  of  approach  for  the  most  efficient 
detail  index  file  design  in  a  document  retrieval  application.  It  takes  ad¬ 
vantage  of  data  characteristics  which  can  be  used  to  minimize  any  one  or 
more  of  record  access,  data  transfer  or  internal  computing  time.  Although 
the  discussion  assumes  t'.at  files  are  maintained  on  mass  storage  devices, 
the  inverted  file  design  also  can  be  used  advantageously  with  magnetic  tapes. 

Any  detailed  file  design  depends  heavily  upon  the  specifications  of  the 
storage  unit  and  its  computer  interface.  Because  these  vary  widely,  only 
the  general  approach  is  outlined.  Modifications  are  necessary  to  fit  the 
general  method  into  the  framework  of  a  specific  equipment  configuration. 

a.  Design  of  the  Inverted  File.  This  file  typically  is  set  up  in 
sequence  on  index  term  code  and  in  document  number  sequence  within  the  record 
for  each  term.  Many  search  requests  contain  several  fairly  common  index 
terms  with  several  hundreds  or  thousands  of  document  numbers  each.  Even  in 
libraries  of  modest  size  and  average  depth  of  indexing,,  a  typical  search  may 
involve  on  the  order  of  10,000  of  these,  each  of  which  must  be  transferred 
into  memory  and  enters  into  a  comparison  loop.  Quite  commonly,  a  small 
group  of,  say,  20  documents,  selected  on  the  basis  of  comparisons  so  far  made, 
is  matched  against  an  index  term  with  2,000  entries — frequently  followed  by 
other  high-usage  terms. 

If  the  index  term  record  with  2,000  entries  could  be  broken  into  200 
subsets,  for  example,  of  about  10  documents  each,  then  the  20  intermediate 
document  numbers  could  be  processed  by  accessing  not  over  20  of  these  subsets 
and  making  about  200  comparisons,  eliminating  90 %  of  the  word  transfers  and 
comparisons  otherwise  needed. 

Four  basic  system  requirements  should  be  met  if  an  inverted  file  is  to 
be  organized  successfully  in  this  manner: 

(1)  The  document  number  itself  must  determine  the  subset  to  which  it 
belongs . 

(2)  Each  subset  should  contain  close  to  the  same  average  number  of 
documents. 

(3)  The  data  should  utilize  a  reasonably  high  percentage  of  the 
potential  capacity  of  the  storage  device. 

(4)  The  system  should  provide  for  increasing  the  number  of  subsets  as 
more  documents  are  added  to  the  index  term  record.  It  should  be 
self-organizing  in  the  sense  that  the  computer  program  includes 
criteria  permitting  automatic  adjustment  of  the  number  of  subsets 
as  documents  are  added  t.o  or  deleted  from  an  index  term. 

In  addition,  a  fifth  requirement  exists  if  the  storage  device  cannot  handle 
variable-length  records;  it  is  closely  related  to  (3): 
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(5) 


With  variation  in  the  number  of  entries,  overflowing  the  capacity 
of  a  subset  is  possible.  The  technique  should  permit  determining 
the  subset  size  necessary  to  give  statistical  assurance  that  the 
probability  of  overflow  does  not  exceed  some  arbitrary  low  value. 

These  requirements  indicate  at  once  that  some  randomizing  technique  o?t 
a  document  number  is  a  possible  means  of  determining  its  subset  and,  for  all 
documents  in  an  index  term  record,  giving  a  statistically-predictable  distri¬ 
bution  of  the  number  of  entries  in  each  subset.  A  simple  randomizing  scheme 
is  suggested.  If  do  ' wren t  accession  numbers  are  assigned  in  ascending 
numerical  sequence — this  is  the  most  common  method — then  the  well-known 
method  of  "terminal  digit"  filing  effectively  provides  the  desired  randomizing. 
For  practical  purposes,  each  of  the  number  0-9  is  the  terminal  digit  of 
exactly  \0%  of  the  documents  in  a  lihrary.  There  is  no  reason  to  assume  that 
the  usage  of  an  index  term  is  in  any  way  related  to  or  dependent  upon  the 
terminal  digits;  i.e.,  there  is  a  probability  p  =  0.1  that  any  given  document 
using  the  term  has  an  accession  number  terminating  in  3,  or  any  other  decimal 
digit.  If  the  term  is  used  in  N  documents,  the  average  number  in  each  of  the 

ten  subsets  0-9  is,  of  course,  pN  and  the  standard  deviation  is  o  =  /paN  . 

Terminal  digit  studies  have  been  made  of  a  number  of  index  terms  in  the 
DOC  sample  and  several  analyses  conducted  on  two  \Q%  subsamples  consisting 
of  document  numbers  ending  in  "2"  cr  "9."  None  of  these  give  any  statistical 
reason  to  doubt  the  randomness  of  index  terms  and  the  terminal  digits  of 
documents.  Creating  subsets  based  upon  terminal  digits,  then,  is  a  statis¬ 
tically  valid  approach  which  will  distribute  entries  into  them  in  approxi¬ 
mately  equal  number  and  with  a  predictable  standard  deviation  from  the 
average. 

Terminal  digit,  filing  is  not  new  in  document  retrieval.  It  has  been 
used  for  many  years  in  manual  systems,  particularly  those  based  upon  the 
well-known  "Uniterm"  concept.  Here  document  number  commonly  are  entered  in 
ten  columns,  based  upon  the  terminal  digit. 

Use  of  decimal  terminal  digits  to  determine  subsets  has  some  practical 
disadvantages.  If  the  number  of  documents  posted  to  an  index  term  increases 
to  the  point  where  more  subsets  are  desirable,  then  adding  the  next  higher 
terminal  digit  (the  "tens"  to  the  "units,"  for  example)  multiplies  their 
number  by  ten.  Also,  each  new  subset  has  only  one-tenth  as  many  entries, 
on  the  average.  Fewer  subsets  could  be  created  by  using  ranges  of  numbers; 
e,g.,  increasing  10  subsets  to  20  is  possible  by  grouping  on  terminal  digits 
00-04,  05-09,  etc.  However,  entry  to  the  proper  subset  is  somewhat  more 
complicated. 


A  preferable  approach  is  to  convert  the  decimal  document  number  to 
binary.  Each  bit  added  as  a  terminal  digit  doubles  the  number  of  sets  and 
halves  their  average  number  of  entries.  Many,  but  not  all,  electronic 
processors,  have  binary  aiithmetic  capabilities  and,  possible  of  even  more 
importance,  sector  addresses  of  many  mass  storage  devices  are  in  binary  form. 

Suppose  an  index  term  record  contains  16  subsets,  determined  by  and 
sequenced  in  order  on  the  four  binary  terminal  digits  0000  through  1111. 

The  location  of  the  entire  record  on  the  mass  storage  device  is  determined 
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through  the  index  term  code.  Desired  subsets  are  specified  by  the  tormina) 
bits  of  a  document  number  and  are  in  a  known  position  relative  to  the  first 
subset  0000,  Consequently  any  specified  subset  can  be  accessed  readily, 
provided  the  number  of  subsets  is  known.  This  may  range  from  a  single  sub¬ 
set  for  infrequently  used  index  terms  up  to  several  thousand  for  the  highly 
common  ones. 

The  most  efficient  technique  so  far  found  interprets  the  storage  unit 
address  or  addresses  for  a  record  in  the  general  form  nAs,  where 

r  is  a  4-bit  prefix  specifying  the  number  of  subsets  2n  (i.e., 

1,  2.  4,  ....  22,678); 

A  is  the  storage  unit  sector  address  o)  the  first  subset  or  group 
of  subsets  for  an  index  term;  and 

s  is  on  increment  to  A  such  that  A  +  s  either  (1)  is  the  storage 
unit  address  for  sunset  s  if  there  is  one  subset  per  sector, 
or  (2)  specifies  the  sector  and  subset  within  sector  if  subsets 
are  grouped  2Z  per  sector. 

nA£  is  stored  as  tr.e  entry  table  address  for  eacn  index  term  in  the  inverted 
file.  Preferably,  it  is  part  of  the  mechanized  thesaurus,  where  it  is 
readily  available  at  the  time  the  terms  of  the  search  request  are  validated. 

Data  transfer  and  comparison  times  are  small  when  there  are  only  a  few 
entries  in  the  average  subset.  Minimizing  these  times  conflicts  with  the 
objective  of  utilizing  a  reasonable  percentage  cf  potential  muss  storage 
capacity.  For  example,  if  an  index  term  record  with  N  =  2n  is  oroken  into 

2n 

subsets  with  an  average  of  four  entries  each,  then 


If  the  subset  size  is  fixed  at  8  words,  the  storage  utilization  is  only  50^ 
and  there  is  a  p  0.025  that  a  subset  will  overflow;  that  is,  on  the 
average  about  one  out  of  40  subsets  can  be  expected  to  have  more  than  8 
entries.  Somewhat  better  storage  utilization  might  be  realized  with  variable- 
length  sectors,  but  the  fixed  hardware  requirements  still  are  a  fairly  large 
percentage — possibly  30-40£--of  potential  capacity. 

Larger  secu-s  result  in  better  storage  utilization  but  also  increase 

n  2n 

data  transfer  and  computing  times.  If  N  =  2  and  -rr  subsets  are  set  up. 

1  V* 

with  an  average  of  16  entries  each,  then 
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Here  a  fixed  subset  size  of  24  words  yields  67#  storage  utilization  with  the 
same  p  =  0.025  overflow  probability.  If  variable-length  sectors  are  per¬ 
missible.  utilization  of  90#  or  more  should  be  possible. 

The  conflicting  objectives  of  small  subset  size  and  reasonably  high 
utilization  of  storage  capacity  are  resolved  on  the  basis  of  cnaracteristies 
of  the  equipment  to  be  used  and  administrative  determination  of  acceptable 
utilization. 

In  the  subdivided  file,  index  terms  are  grouped  by  number  of  subsets 
included  and  ordered  in  ascending  sequence  on  this  number.  The  first  group 
consists  of  terms  appearing  in  a  single  document,  a  sector  of  n  words  con¬ 
taining  n  terms.  Inoe;  terms  with  2,  3,  4,  ...  usages  similarly  are  grouped 
and  packed  several  per  sector;  the  sequence  of  document  numbers  within  each 
term  is  on  terminal  bits.  This  grouping  continues  until  the  number  of  usages 
is  enough  to  warrant  creation  of  two  suusets  and  splitting  documents  into  two 
groups  based  upon  the  terminal  bit.  With  sectors  of  eight  words,  analysis 
of  the  QDC  sample  indicates  that  the  split  can  begin  with  terms  having  5  to 
6  usages.  The  first  groups  of  terms  then  have  this  format. 

1  Usage:  9  index  terms  per  sector 

2  Usages:  4  index  terms  per  sector 

3  Usages:  2  index  terms  per  sector 

4  Usages:  2  index  terms  per  sector 

For  these,  the  machine  address  carried  in  the  entry  table  in  the  form  aNs  is 
interpreted  thus: 

a:  Number  of  usages  of  index  term; 

N:  Mass  storage  unit  address  of  sector  containing  the  term, 

s:  Relative  number  of  term  record  within  sector. 

The  remainder  of  the  index  terms  are  established  initially  in  the  mini¬ 
mum  possible  number  of  sectors.  Thus,  still  using  8-word  sectors,  all  terms 
with  5-8  usages  always  can  be  stored  in  two  sectors,  based  upon  "0"  or  ”1" 
as  terminal  bits.  Most  terms  with  9-12  usages  also  can  be,  as  can  some  with 
13-16,  the  probability  of  overflow  increasing  with  the  number  of  terms.  If 
overflow  occurs,  the  number  of  sectors  is  doubled  ana  the  assignment  of 
document  numbers  made  on  the  basis  of  two  terminal  bits- -00,  01,  10  and  11. 
Thus,  although  the  2d  level  is  used  to  determine  sector  capacity  and  the 
probability  of  overflow,  the  latter  is  not  allowed  to  occur. 

Most  terms  with  up  to  22-24  usages  can  be  contained  in  four  sectors, 
as  can  some  with  25-32.  Whenever  an  overflow  occurs,  the  number  of  sections 
again  is  doubled  and  another  terminal  bit  added  for  sector  identification. 
This  cycle  is  repeated  until  all  index  terms  have  been  set  up  in  the  sub¬ 
divided  inverted  file.  Each  term  is  placed  in  the  minimum  number  of  sectors 
for  which  no  overflow  occurs. 


As  new  documents  are  added  to  the  file,  they  are  entered  in  the  proper 
sector  for  each  index  term*  Whenevt  ••  «  sector  l'cr  a  term  overflow,  their 
number  is  doubled  ai.d  the  record  is  transferred  info  the  next  higher  group 
on  the  mass  storage  device.  Simultaneously,  the  machine  address  in  the 
index  term  entry  word  is  changed  to  the  new  location.  Thus  the  updating 
program  continuously  reorganizes  the  file  as  sector  subdivision  becomes 
necessary,  the  movement  always  being  toward  a  larger  number  of  sectors. 

The  basic  procedure  can  be  applied  for  any  desired  sector  size  and 
percentage  utilization  of  the  mass  storage  unit  capacity.  The  systematic 
breakdown  of  document  numbers  permits  searches  to  be  localized  within  specific 
sectors  determined  bv  the  numbers  of  the  documents  which  have  met  the  criteria 
up  to  the  current  stage  of  processing. 

It  may  he  noted  also  that  this  technique  of  terminal  digit  filing  ca.i 
reduce  significantly  the  theoretical  number  of  bits  required  to  hold  the 
inverted  file.  When  a  sector  contains  only  documents  which  have  tne  same 
s  terminal  bits,  then  they  become  redundant  and  need  not  be  retained  in  the 
stored  record.  For  frequently  used  index  terms,  where  s  >  7  or  8,  these 
potential  savings  exceed  30$  of  the  number  of  bits  in  a  document  number  and, 
for  very  common  terms,  may  approximate  75$.  Thus  either  more  documents  can 
be  stored  in  a  sector  of  given  bit  capacity  or,  alternatively,  a  constant 
number  of  documents  stored  in  fewer  bits.  With  existing  equipments  and  mass 
storage  units,  this  potential  saving  probably  cannot  be  realized. 

b.  Order  of  the  Document -Sequenced  File,  if  document-sequenced  file 
is  used  m  conjunction  with  an  inverted  file,  access  time  to  document  records 
can  be  minimized  if  they  are  grouped  on  terminal  digits.  Suppose,  for 
example,  that  the  tracks  on  a  disc  or  drum  are  broken  into  16  major  sectors, 
numbered  (in  binary)  from  0000  to  1111.  Each  document  record  is  stored  in 
the  major  sector  determined  by  the  four  terminal  bits  in  the  document  number. 

Because  documents  in  the  inverted  file  are  sequenced  and  processed  in 
this  same  order,  any  list  of  document,  records  to  be  accessed  also  is  in  this 
order.  Therefore  up  to  16  separate  records  can  be  transferred  to  the  proc¬ 
essor  memory  during  a  single  revolution  of  the  drum  or  disc  storage  unit, 

A  random  search  for  the  same  documents  would  be  at  average  rate  of  only  two 
per  revolution.  Ordering  of  the  document  records  on  terminal  digits  thus 
eliminates  a  large  percentage  of  this  average  access  time. 


It  is  concluded  that  the  combination  of  an  inverted  and  document- 
sequenced  file  is  a  more  efficient  type  of  organization  than  tne  rcnventional 
list-organized  file.  In  addition,  this  dual  iile  can  be  set  up  to  reduce 
both  the  processing  time  in  handling  a  search  request  and  the  time  required 
to  access  complete  document  records.  These  advantages  cannot  be  realized 
with  the  list-organized  file. 
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II.  INDEX  TERM  ASSOCIATIONS  IN  THE  DDC  SAMPLE 


Creation  of  the  data  files  to  simulate  the  operation  of  the  Multi-List 
System  resulted  in  the  formation  of  all  pair  associations  among  the  599 
most  common  DDC  descriptors.  In  addition,  some  other  association  statistics 
have  been  developed  during  the  statistical  analysis  of  the  characteristics 
of  this  sample  file.  Some  of  the  results  are  presented  in  this  section. 

The  discussion  is  not  a  comprehensive  study  of  pair  associations  and  their 
uses  in  a  document  retrieval  application. 

A.  ASSOCIATIONS  AMONG  THE  599  MOST  COMMON  DESCRIPTORS 

1 .  Occurrences  of  Pair  Associations. 

The  599  descriptors  have  49,306  different  pair  combinations — 27.6$  of 
the  numbei  possible — with  248,425  occurrences,  an  average  of  almost  exactly 
five  each.  41$  of  the  pairs  occur  only  once  aid  almost  80$  five  times  or 
less.  Only  2$  of  the  pairs  appear  33  times  or  more,  but  they  represent  25$ 
of  total  occurrences.  Table  A-2  (Appendix  A)  susnmarizes  the  distribution  of 
pairs  by  number  occurrences.  The  cumulative  percentages  of  different 
pairs  and  total  occurrences  also  are  shown  graphically  in  Chart  5. 

Thn  entire  38,402  document  sample  has  about  209,000  different  pairs 
with  530,800  total  occurrences.  The  10.7$  of  descriptors  comprising  the 
599  most  frequently  used  generate  24$  of  the  different  pairs  and  47$  of  the 
occurrences.  The  remaining  89.3$  of  descriptors  in  the  sample  create  about 
160,000  different  pairs  with  282,400  total  occurrences,  an  average  of  only 
1.77  occurrences  per  pair.  Evidently,  in  the  sample  as  a  whole,  multiple 
occurrences  of  pairs  are  in  the  minority. 

2 •  Different  Pairs  and  Occurrences  Among  the  599  D »scriptors . 

It  has  been  noted  that  the  number  of  different  pairs  decreases  with 
frequency  of  usage  among  the  599  most  common  DDC  descriptors.  This  question 
naturally  arises:  Is  there  any  close  correlation  between  the  number  of 
different  pairs  created  and  the  total  occurrences  of  those  pairs?  Table  A-3 
(Appendix  A)  shows  the  distribution  of  the  599  descriptors  against  these  two 
factors  as  coordinates.  Although  it  indicates  a  general  correlation,  the 
distribution  is  marked  by  wide  variations.  In  general,  descriptors  creating 
relatively  few  different  pairs  have  fewer  average  occurrences  per  pair  than 
those  with  many.  However,  for  any  ene  range  of  numbers  of  different  pairs, 
average  occurrences  for  different  descriptors  usually  vary  by  factors  of 
three  or  four  to  one. 
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3.  Association  Factors  for  Pair  Occurrences. 


One  measure  of  association  is  p(B|A),  the  probability  of  occurrence  of 
descriptor  B  in  a  document,  given  that  it  contains  descriptor  A.  The  two 
permutations  of  a  pair  result  in  two  such  probabilities,  which  in  general 
are  different:  p( A j B)  f  p(BjA). 

Let 


f  = 


Then 


Number  of  occurrences  of  the  pair,  descriptor  A  with  descriptor  B; 
Frequency  of  usage  of  descriptor  A  alone: 

Frequency  of  usage  of  descriptor  B  alone. 


p(  A i  B) 


tt-  and  p(  B |  A)  -  . 

A 


Table  A-4  (Appendix  A)  summarizes  the  values  of  £  for  the  98,612  pair  permu¬ 
tations  among  the  599  most  common  descriptors,  almost  two-thirds  of  them 
have  p  <  0.015.  Only  61  have  p  >  0.50.  Of  these,  only  two  are  permutations 
of  the  same  pair:  "Peroxides"  (A)  and  "Hydrogen  Compounds"  (B),  for  which 
p(AiB)  =  0.75  and  p( B j A)  0.64.  (f  =  56;  F^  =  87  and  Fg  -  75.)  It  may  be 

noted  that  in  a  hierarchal  descriptor  relationship,  "Peroxide"  would  be 
expected  to  fall  into  the  class  of  "Hydrogen  Compounds"  and  thus  the  proba¬ 
bility  of  occurrence  of  the  latter,  given  "Peroxide"  as  being  present  in  a 
document,  should  be  greater  than  the  converse  relationship.  Actually,  the 
reverse  condition  exists.  No  meaningful  conclusion  is  apparent. 


For  11  of  the  remaining  59  permutations  with  p  >  0.50,  the  converse 
probability  is  between  0.20  and  0.50;  the  rest  range  downward  to  18  for 
which  p  <  0.02.  Further,  for  40  of  these  59,  the  second  descriptor — the  one 
whose  probability  of  occurrence  is  given  by  £ — is  one  of  six  very  common 
ones.  Design  (Rank  1);  Tests  (2);  Guided  Missiles  (5);  Radar  Equipment  (17): 
Polymers  (36);  and  Projectiles  (37).  For  most  of  these,  the  converse  proba¬ 
bility  is  quite  low.  This  is  to  be  expected;  these  common  descriptors  appear 
in  thousands  of  documents  compared  to  a  few  hundreds  at  most  for  the  other 
member  of  the  pair.  For  example,  "Cargo  Vehicles"  (Rank  526)  appears  in  82 
documents.  51  of  which  also  contain  "Tests."  Thus,  p(Tests j Cargo  Vehicles) 

~  0.62.  "Tests,"  however,  is  used  in  5,237  documents  and  therefore 
p( Cargo  Vehicles jTests)  =  0.01. 


4 .  Associations  of  50  Most  Common  Descriptors. 

Table  A-5  (Appendix  A)  details  the  number  of  pairs  and  total  occurrences 
for  each  of  the  50  most  common  descriptors.  Associations  are  broken  down 
into  those  with  the  599  most  common  and  with  the  remaining  4,941  descriptors. 
This  table  again  makes  it  apparent  that  several  occurrences  of  a  pair  are  the 
exception,  even  when  one  member  is  common.  (The  50  most  common  descriptors 
are  used  in  443  or  more  documents;  the  4.941  less  common  have  71  or  fewer 
usages. ) 


B.  DESCRIPTOR  ASSOCIATIONS  AMONG  DDC  GROUPS  AND  FIELDS  OF  INTEREST 


The  summaries  described  here  are  based  inon  the  292  groups  and  19  fields 
of  interest  described  in  the  AST1A  thesaurus,  i960  edition,  appUcable  during 
the  time  period  covered  by  the  sample.  There  now  are  33  fields. 

1 .  Most  Common  Descriptors  Summarized  by  Field. 

Table  A-6  (Appendix  A)  summarizes  the  599  most  common  descriptors  into 
ASTIA  fields,  together  with  the  number  of  pair  permutations  having  one  or 
both  members  in  the  field  and  their  total  occurrences.  Some  fields  and  groups 
are  richly  represented;  others  have  few  descriptors  among  these  599.  This 
variation  reflects  the  types  of  documents  in  the  sample  and,  by  extension, 
the  relative  distribution  of  document  acquisitions  by  fields  of  interest. 
Although  the  thesaurus  must  provide  for  adequate  indexing  of  documents  in  all 
fields  of  interest,  descriptor  usage  is  a  function  of  the  types  and  numbers 
of  documents  received.  Descriptors  in  fields  represented  by  many  documents 
not  only  have  many  chances  to  be  used,  but  also  many  chances  to  create 
different  pairs  and  multiple  occurrences  of  one  pair. 

2.  Associations  Classified  by  Group  and  Field  of  Interest. 

It  is  desirable  to  test  the  hypothesis  that  the  DDC  thesaurus  has  a 
hierarchal  structure  which  is  reflected  in  descriptor  associations  and  which 
can  be  used  as  a  tool  in  formulating  search  requests. 

For  this  purpose,  the  pair  associations  formed  by  the  descriptors  in 
each  of  the  155  groups  have  been  summarized  and  classified  by  all  of  the 
other  groups  to  which  the  second  descriptor  of  each  pair  has  been  assigned. 

Each  group,  A,  is  represented  by  a  single  summary  page  which  lists  every 
other  group  B^,  having  descriptors  associated  with  those  in  A.  Three  quanti¬ 
ties  are  accumulated  for  each  of  B.  entries:  (11  Number  of  different  de¬ 
scriptors  in  group  A  entering  into  associations  with  those  in  group  B^; 

(2)  number  of  different  pairs  formed;  and  (3)  total  occurrences  of  these 
pairs.  In  addition,  the  last  two  quantities  are  totalled  for  each  of  the 
19  major  fields  of  interest  into  which  the  292  groups  are  combined.  Table 
A-7  (Appendix  A)  shows  a  typical  page  of  this  summary;  it  is  for  group  145 
(Materials)  in  field  10  (Materials  and  Metals!. 

55  of  the  groups,  or  35$,  have  only  one  descriptor  each  and  another  30 
have  two.  13.  or  about  8$,  include  ten  or  more  descriptors.  The  number  of 
other  groups  with  which  associations  occur  averages  93.5,  about  60$  of  the 
number  possible.  The  range  is  from  36  (Drugs  and  Biolegicals,  group  072, 
with  one  descriptor)  to  the  maximum  of  154  for  General  Concepts,  group  292, 
with  15  descriptors.  There  is  a  definite  correlation  between  the  number  of 
descriptors  in  a  group  and  the  number  of  other  groups  involved  in  associations. 
The  55  groups  with  only  one  descriptor  each  form  associations  with  an  average 
of  66  other  groups;  the  13  with  ten  or  more  descriptors  average  141.5  each. 

Table  A-8  (Appendix  A)  summarizes,  by  fields,  the  frequencies  of  pair 
associations,  together  with  the  number  of  occurrences  for  which  both  de¬ 
scriptors  are  in  the  same  group  or  the  same  field-of-interest.  Co-usage  of 
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two  descriptors  in  ont  ro u p  represents  only  1,788,  or  1*0?.,  of  the  number 
of  different  pairs  and  4.5$  of  total  occ.rrences.  Although  seemingly  low, 
this  is  over  755?  of  the  possible  number  of  intragroup  pairs.  For  most  groups 
with  2-4  descript<  rs,  ali  possible  pairs  actually  exist,  the  percentage 
occurring  decreasing  slowly  (and  not  uniformly)  as  the  number  of  descriptors 
in  the  group  increases.  Only  four  of  the  100  groups  with  two  or  more  de¬ 
scriptors  have  no  intragroup  pairs,  all  four  have  either  two  or  three  de¬ 
scriptors.  Thus  if  two  of  these  599  most  common  descriptors  are  in  the  same 
group,  there  is  a  high  probability  that  they  will  be  associated  in  use. 
Furthermore,  they  are  likely  to  occur  times  as  often  as  other  pairs. 
However,  intragroup  associations  are  only  a  relatively  insignificant,  part  of 
all  of  them. 

Although  they  account  for  only  115?  of  the  number  of  different  pairs, 

51$  of  the  intrafield  associations  which  can  exist  do  occur  in  the  sample — 
9,582  of  a  possible  18,558.  Actually,  17  of  the  19  fields  exceed  this  per¬ 
centage  and  9  have  more  than  705?  of  the  possible  pairs.  The  over-all  average 
is  heavily  weighted  by  the  133  descriptors  in  Physics  and  Mathematics;  only 
3,673  (42%;  of  the  8,778  possible  do  exist  and  47%  of  the  potential  number 
is  concentrated  in  this  one  field. 

Interfield  associations  predominate  among  these  599  descriptors.  Table 
A-9A  (Appendix  A)  summarizes  these  interlicld  usages  by  numbers  of  different 
pairs  and  Table  A-9B  by  numbers  of  occurrences.  (Entries  in  the  body  of 
these  tables  are  symmetrical  about  the  underlined  diagonal.)  All  possible 
combinations  exist  except  for  Bio-Sciences  with  Civil  Engineering  or  Propul¬ 
sion  Systems.  As  might  be  expected,  all  fields  form  many  associations  with 
descriptors  in  Applied  Research,  Miscellaneous  Arts  &  Sciences  and  Physics 
«  Mathematics.  Table  A-9C  shows  the  number  of  associations  actually  existing 
as  a  percentage  of  the  number  possible. 

The  foregoing  comments  can  be  summarized  briefly.  Among  these  common 
descriptors,  there  is  a  0.25  probability  that  any  two  taken  at  random  will 
be  associated  in  use.  If  the  two  are  in  the  same  ODC  field,  the  probability 
of  co-occurrence  is  doubled;  if  in  the  same  group,  tripled.  On  the  average, 
almost  90$  of  the  different  pairs  and  85$  of  total  occu*rences  involve  de¬ 
scriptors  in  two  fields.  Pairs  within  the  same  group  have  a  markedly  higher 
average  number  of  occurrences  than  other  pairs;  those  within  one  field  have 
a  somewhat  higher  average.  All  of  these  data  have  been  based  upon  an  analy¬ 
sis  of  the  599  most  common  descriptors  in  a  file  of  38,402  documents,  each 
descriptor  occurring  in  72  or  more  of  them. 

Whether  or  not  these  results  indicate  any  tendency  toward  a  "hierarchal 
structure"  ir.  descriptor  associations  is  somewhat  uncertain.  Although  intra¬ 
group  and  intrafield  associations  of  descriptors  are  much  more  probable  than 
the  others,  and  occur  more  often,  it  seems  questionable  to  base  a  hierarchy- 
on  1G$  or  less  of  different  pairs  and  15$,  at  most,  of  occurrences.  Inter¬ 
field  associations  of  descriptors  are  predominant.  Furthermore,  frequently 
occurring  pairs  are  the  exception.  41$  occur  only  once,  79%  five  times  or 
less,  and  half  of  all  occurrences  are  accounted  for  by  pairs  appearing  12 
times  or  less. 
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3 •  Pair  Associations  Among  Ail  Descriptors. 

Talbe  A-10  (Appendix  A)  summarizes  pair  occurrences  among  all  descriptors 
in  the  sample,  classified  by  the  number  of  usages  of  descriptors.  The  5,540 
descriptors  in  the  38,402  documents  form  418,400  pair  permutations  with 
1,061,600  occurrences,  an  average  of  only  2.5  each.  It  is  estimated  that 
over  80$  of  the  pairs  in  the  sample  occur  only  once  or  twice  each, 

C.  COMMENTS  ON  STATISTICAL  ASSOCIATION  MEASURES 


Many  of  the  association  measures  which  have  been  proposed  are  based 
upon  the  conventional  2-way  contingency  table,  or  can  be  expressed  in  terms 
of  its  cell  entries; 


I 

II 

Total 

1 

f 

B  -  f 

B 

2 

A  -  f 

N  -  A  -  B  +  l 

N  -  B 

Total 

A 

N  -  A 

. 

N 

where 

A:  Number  of  documents  described  by  an  index  term  D^. 

B:  Number  of  documents  described  by  an  index  term  D^. 

f:  Number  of  documents  described  by  both  index  terms  and  D^. 

N:  Number  ol  documents  in  the  library- 

Occasionally.  it  is  desirable  to  consider  the  total  occurrences  of  all  index 
terms,  both  singly  and  in  pairs.  This  noiation  is  used: 


A. 


x 


f.  . 
i.J 
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A. 


l 
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f.  . 

i.J 


Number  of  documents  described  by  D^. 

Number  of  documents  described  by  both  and 

Number  of  different  index  terms,  D.,  used  in 
c  i 


1  L  J 


Total  number  of  occurrences 


D.. 

j 

a  document. 

of  all  index  terms. 


i=l  j=l 


Total  number  of  occurrences  of  all  pairs  formed  by 

all  index  terms  D.D.. 

^  J 
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In  the  DDC  sample,  the  number  of  occurrences  of  any  random  index  term 
i).  usually  is  very  small  compared  with  N  (=  38,402).  Of  the  3540  different 

descriptors  represented,  only  50  occur  over  400  times;  i.e.,  for  99$  of  the 

descriptors,  A.  <  0.01N.  For  80$  of  them,  A.  <  0.001N.  Because  f.  .  cannot 
p  i  i  i,J 

exceed  the  lesser  of  A.  and  B.,  it  follows  that,  in  most  cases,  the  magnitudes 

1  J 

of  f,  A-f  and  B  -  f  in  the  contingency  table  are  small  compared  with  the  fourth 
entry,  N-A-B  +  f.  Although  comparable  data  for  other  applications  have  not 
been  seen,  it  appears  probable  that  most  of  them  will  be  somewhat  similar  in 
nature  to  that  of  DOC,  possibly  with  smaller  percentages  of  index  terms  at 
the  0.01N  and  0.001N  levels--95-98$  with  A.  <  0.01N  and  40- 75%  with 
A.  <  0.001N.  1 


1 .  Association  Measures. 

Among  the  first  measures  of  association  proposed  were  three  by  Maron 
and  Kuhns  [ 10] ,  who  developed  them  as  part  of  a  more  general  statistical 
approach  to  the  problem  of  document  retrieval.  The  first  is  the  conditional 
probability  that,  if  the  term  0^  is  assigned  to  a  document,  then  D^  also  is: 


P(  D. 


V  *1 


(1) 


The  second  is  the  inverse  conditional  probability  of  (1);  i.e.,  if  D^ 
is  known  to  be  assigned  to  a  document,  then  Dp  also  is: 


p<  db! da)  =  f 


(2) 


This  actually  is  not  a  second  relationship,  but  the  first  with  DA  and  Dg 

interchanged  in  meaning.  However,  its  differentiation  is  desirable,  because 
in  general  PlD^lDg)  f  anc*'  ^act*  ’5  equal  only  if  A  -  B,  which 

is  not  often  the  case. 

P(D^’Dg)  ranges  in  value  from  zero  (f  =  0)  to  1  ( f  =■  B)  and  is  easy  to 

calculate.  As  a  useful  measure  of  association,  it  has  been  considered 

deficient  by  several  investigators  because  it  does  not  take  into  account  the 

number  of  co-occurrences  of  and  Dg  which  are  to  be  expected  on  the  basis 

of  chance.  This  evidently  is  a  function  of  the  magnitudes  not  only  of  A  and 
B,  but.  also  of  N,  which  does  not  appear  in  (1)  and  (2),  To  overcome  this 
objection,  Maron  and  Kuhns  introduce  a  third  measure,  a  contingency  estimate, 
which  removes  from  f  the  magnitude  to  be  expected,  on  the  basis  of  chance, 
given  the  actual  values  of  A,  B,  and  N: 


=  f  - 


AB 
N  ’ 
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They  then  introduce  art  arbitrary  coefficient  of  association,  based  upon  S, 
ranging  in  value  from  -1  to  +1  and  equal  to  zero  when  S  -  0.  This  coefficient, 
is  of  the  form 


Q(D.,  Dn)  -  — ~ — 
y  A  B  xv  +  wo 


Stiles  [ll]  also  starts  with  the  contingency  table  given  above,  and,  using 

the  Yates  correction  for  a  2x2  table  with  one  degree  of  freedom,  adopts  as 

an  "association  factor"  (A.F,)  the  base  10  logarithm  of  the  expression  for 

2 
X  : 


A.F.  =  log  n  x“  "  log 


( jfN  -  AB|  -~)2N 

4« 

10  ABYn  -  A)(N  -  B) 


In  use.  all  co-occurrences  having  A.F.  >  1  are  retained  as  having  potential 
usefulness,  others  being  discarded.  At  this  point,  there  is  a  probability 
on  the  order  of  0.001  that  an  observed  frequency  of  co-occurrence,  f,  is 
due  to  chance  factors  for  the  given  values  of  A,  B,  and  N.  Association  fac¬ 
tors  of  5  or  more  (yfc  >  100,000)  are  not  unusual  in  libraries  of  more  than 
100,000  documents. 

Doyle  [12]  introduces  another  measure  to  indicate  strength  of  associa¬ 
tion: 


(5) 


This  has  a  wide  range  of  values  and,  because  frequently  N  »  AB,  n:3y  be 
quite  large  for  small  f,  It  is,  of  course,  zero  when  f  =  0,  i.e,,  when  the 
pair  does  not  exist  in  any  document. 

The  expressions  (1)  to  (5)  ail  are  based  upon  the  total  population  of 
indexed  documents.  N,  wnich  xs  divided  into  four  subsets: 

(1)  Those  containing  the  term 

(2)  Those  containing  Bg, 

(3)  Those  containing  both  D,  and  Dr,, 

(4)  Those  containing  neither  term. 

They  include  normalizing  procedures  to  adjust  the  sizes  of  the  group  f  to 
remove  the  effect  that  may  result  from  the  tendency  of  i),  and  0D,  considered 

separately,  to  occur  frequently  as  index  terms.  Such  normalization  is  re¬ 
quired  because,  the  more  frequently  an  index  term  occurs,  the  more  frequently 
it  is  apt  to  be  used  with  some  other  term  simply  or.  a  chance  basis. 


i 


t 


"r 


i 


2 ,  Usefulness  of  Associations  Which  Occur  Only  a  Few  Times . 

In  most  cases,  it  is  extremely  dubious  if  any  particular  significance 
can  be  attached  to  c  unique  index  term  "association."  This  is  self-evident 
if  one  of  terms.  A,  appears  in  only  one  document.  If  it  contains  c  terms, 

A  must  form  c-1  single-occurrence  pairs,  regardless  of  the  "statistical 
odds"  against  any  particular  pair  A8.  Similarly,  terms  used  in  only  a  few 
documents  tend  to  form  mostly  unique  pairs — over  95$  in  the  000  sample  for 
X  -  2  tc  5.  Although  the  percentage  of  multiple  occurrences  increases  with 
A  anti  B,  even  the  599  most  common  have  40$  of  their  different  pairs  unique. 
Theoretically,  a  frequency  distribution  of  expected  pair  occurences,  based 
on  chance,  could  be  calculated  for  each  of  them.  However,  even  if  the 
number  of  unique  pairs  for  a  given  A  differs  significantly  from  the  chance 
expectation,  in  many  cases  there  is  no  way  of  determining  whether  or  not  a 
specific  pair  AB  represents  a  significant  association. 

The  cases  where  f  is  small — say  2  to  5 — may  require  more  detailed 
analysis  than  they  have  so  far  received.  If  A  also  is  small,  then  P(DgjB^) 

may  be  meaningful.  For  example,  f  -  2  and  A  =  3  give  some  reason  to  believe 
t hat  A.  which  co-occurs  with  B  in  two  of  its  three  uses,  may  have  a  signifi¬ 
cant  association  with  B.  The  degree  of  confidence  is  strengthened  if  the 
indexing  of  additional  documents  creates  such  > alios  as  4/6  or  5/7  and  de¬ 
creased  if  they  become,  say,  2/5  or  3/8.  It  is  possible,  but  considered 
unlikely,  that  the  limited  amount  of  information  in  a  single  occurrence  in¬ 
creases  sharply,  simply  by  adding  another  occurrence.  In  any  event,  it 
appears  as  if  some  attention  should  be  paid  to  these  occurrences,  with  the 
specific  objective  of  ascertaining  parametric  criteria  for  distinguishing 
the  ''meaningful"  from  "nonmeaningful." 

However,  if  A  and  B  are  relatively  large,  then  small  values  of  f  may 
indicate  a  significant  "negative  association"  between  them.  The  theoretical 
frequency  of  co-occurrence,  assuming  independence,  is 


and,  if  this  value  >  5,  the  difference  between  observed  and  theoretical 
frequencies  can  be  tested  by  standard  statistical  methods  for  significance. 
In  the  DDC  sample,  fcr  example,  the  two  high-usage  terms  "Temperature"  (6th 
ranked  with  1,409  occurrences)  and  "Countermeasures"  l 29th,  with  846)  occur 
together  in  only  one  document.  The  difference  between  the  theoretical  fre¬ 
quency  of  33  co-occurrences  and  the  one  actually  observed  has  a  very  small 
probability  of  being  explainable  by  chance  and  it  is  concluded  that  the  two 
terms  have  a  significant  negative  association.  [In  equation  (4)  of  Section 
1,  this  occurs  when  fb!  -  AB  is  negative.]  In  general,  a  significant  negative 
association  can  be  established  statistically  only  when  AB  >  5N;  or  a  little 
less  if  the  case  f  =  0  (no  co-occurrences)  is  considered.  Because  at  least 
one  of  the  t^rms  must  be  used  in  v'5N  or  more  documents,  only  a  small  per¬ 
centage  of  possible  or  actual  pairs  are  susceptible  to  this  determination. 
In  the  ODC  sample,  only  50  terms  occur  more  than  v'5N  ~  4:18  times;  only 
17,900  pairs  have  AB  >  5N. 
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3*  The  Conditional  Probability  P(  j  Pp )  -  f/R, 

This  is  easy  to  calculate  end  interpret:  If  a  given  document  contains 
the  term  i)g4  it  is  the  probability  that  it  also  contains  D^.  However,  its 

significance  is  difficult  to  measure.  f/E  is  independent  of  the  actual 
magnitudes  of  f  and  B;  it  do-»s  not  involve  A  at  all,  except  that  by  defini¬ 
tion  A  >  f;  and  without  introducing  N,  it  cannot  be  determined  whether  or 
not  f  represents  a  significant  association. 

Despite  these  deficiencies,  the  conditional  probability  has  one  feature 
considered  definitely  desirable:  It  is  a  measure  of  the  association  in  the 
direction  required  by  the  search  request.  For  most  pairs,  P(D^jDg)  and 

P(OgjD^)  not  only  differ,  but  differ  markedly.  Whether  or  not  a  term  should 
be  added  to  the  search  request  can  well  depend  upon  which  one  already  is  in 
it.  If,  for  example  P(D^|Dj,)  =  5/6  and  P(DgjO^)  --  5/200.  it  is  not  at  all 

obvious  th3t  identical  actions  ihould  be  taken  regardless  of  which  of  the 
two  terms  is  in  the  original  request.  Additionally,  statistical  tests  for 
the  significance  of  f  do  not  depend  upon  the  individual  values  of  A  and  B, 
but  only  upon  their  product.  The  conditional  probabilities  definitely  in¬ 
crease  our  knowledge  of  the  nature  of  the  association. 

The  frequency  distribution  of  Table  A-4  (Appendix  A)  gives  P(D^jDg), 

rounded  to  two  decimal  places,  for  all  pairs  among  the  599  most  frequently 
used  DDC  index  terms.  Note  that  entries  for  f/B  =  .01  include  the  40,436 
pairs  (B^D.  /  DjO.)  occurr*n3  only  once,  (The  maximum  value  of  i,/B  is  i/72, 

which  rounds  to  .01.)  This  distribution  probably  is  roughly  typical  when 
both  0^  and  0g  have  fairly  high  usage.  It  would  be  quite  different  if  all 

index  terms  were  included.  For  e<ample,  index  terms  used  in  from  1-10 
documents  form  a  quite  large  number  of  different  pairs  for  which  f  -  1  or  2, 
resulting  in  pronounced  peaks  at  the  values  l/B,  B  -  1  to  10. 


4 .  Association  Fac tors  snd  Coefficients. 


Equations  (?)  and  (4)  of  Sect  inn  1  are  typical  of  association  coeffi¬ 
cients  designed  to  indicate  the  probability  that  an  observed  frequency  of 
co-occurrence  will  differ  from  the  theoretical  frequency  by  purely  chance 
factors.  The  basic  approach  uses  the  2x2  contingency  table,  whose  cell 
entries  c'n  be  determined  readily  from  the  known  values  of  f,  A,  B.  and  N, 

The  hypothesis  tost  A  and  B  are  independent  is  tested  by  tne  x*  statistic. 
Because  Stiies'  Association  Factor  is  the  logarithm  of  a  computational 

2 

apj  roxim.ation  to  v  ,  it  is  used  here  for  illustrative  purposes: 


A.F. 


iogi0  x 


( | fN  -  AS I  -  0.5N)^N 
~  A8?N  -  A)(N  -  S) 


(6) 
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When  fN  >  AB,  which  almost  always  is  the  case  in  document  descriptions ,  the 
observed  frequency  is  greater  than  the  theoretical  frequency.  If  f  and  N 
are  fixed,  and  A  and  B  are  relative!}  small  compared  with  N,  then  yp-  (or  A.F.) 

varies  inversely  as  the  magnitude  of  tne  product  AB.  As  AB  decreases  to  its 

o  o 

minimum  possible  value  fc’(A  =  B=f),  v  increases  to  its  maximum  value.  An 
idea  of  the  range  of  values  of  A  and  £  for  which  y~  will  exceed  any  desired 
value  thus  can  be  obtained  once  the  product  AB  is  known.  The  tables  on  the 

o 

next  page  give  these  values  for  x  >  10,100  and  1,000 — corresponding  to 
A.F,  >  1,  2,  and  3-~and  for  several  values  of  f  and  N,  The  three  tables  at 
the  left  give  the  maximum  value  of  B,  which  occurs  when  A  =  f.  For  example, 

if  N  =  50,000  and  A  =  f  =  5.  then  y2  >  10  for  all  B  <  13,563;  x"  >  100  for 
2 

B  f  1,029;  and  v  >  1,000  for  B  <  201.  The  right-hand  three  tables  give  the 

2 

maximum  value  of  the  product  AB;  in  the  example,  above,  y  >  10  for  all 
AB  <  67,815. 


If  AB  is  considerably  less  than  N — say  0.1N  or  less — x  is  given 
approximately  by 


Y 


,  o 

(f  ~i)  N 
AB 


(7) 


It  is  evident  at  once  that  the  value  of  X  is  extremely  sensitive  to  and 
increases  rapidly  with  f,  particularly  when  f  is  c-msll. 


The  A.F.  proposed  by  Stiles  compresses  these  wide  variations  bv  using 
2  9 

the  logarithm  of  X~  itself.  AF  =  1.00  «.*,hen  X~  =  10,  for  example,  and 
A.F.  -  3  for  x2  -  1000. 


The  appropriateness  of  using  contingency  tables,  and  specifically  the 

2x2,  and  the  x^  statistic  is  questionable.  Equation  (6i  approximates  the  x" 
distribution  only  when  the  theoretical  frequencies  in  each  cell  are  reasonable 
in  magnitude  and  in  practice  should  not  be  used  unless  each  such  cell  entry 
is  at  least  5.  In  the  case  of  index  term  associations,  the  theoretical 
frequencies  A,  0,  and  N  are  taken  tu  be  the  same  as  those  observed,  N  always 
being  quite  large.  Many  of  the  A  and  B  are  less  than  5,  the  exact  percentage 
varying  with  library  size,  number  of  different  indexing  terms,  depth  of 
indexing,  etc.  However,  the  theoretical  frequency  of  co-occurrences, 


f 


t 


AB 

N 


practically  never  is  as  great  as  5.  It  will  not  be  unless  AB  >  5N  and 

2 

typically  is  much  less  than  1.0.  The  *’y  ”  calculated  in  these  cases  is 
difficult  to  interpret  and  its  meaning  becomes  progressively  more  nebulous 
as  its  magnitude  increases.  In  particular,  there  is  no  good  reason  to  con- 

2 

elude  that  large  differences  in  the  magnitude  of  two  x  ’  s  actually  represent 
any  real  difference  in  the  "degree  of  association”  of  two  pairs  of  index 

o 

terms,  or  that  the  two  x~  values  can  be  used  as  measures  of  the  degrees  of 
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association.  Consequently.  the  orderinq  into  sequence  of  all  terms  associated 

2 

in  use  with  a  given  D,,  based  upon  the  value  of  y~,  does  not  give  any  assurance 

•i 

tha»  the  resultant  order  of  the  0n  is  even  approximately  correct.  The  un- 

,3  2 

certainty  probably  is  greatest  for  the  larger  values  of  x  •  Because  these 
values,  averaged  and/or  normalized,  ultimately  became  document  "relevance 
numbers,"  a  similar  uncertainty  exists  in  them. 

It  must  be  observed  that  the  use  of  association  measures  based  upon  the 
2x2  contingency  table  has  produced  apparently  useful  results,  even  though  the 
approach  itself  is  open  to  theoretical  question.  Usefulness  of  resu.ts,  of 
course,  is  the  ultimate  test  of  any  measure  of  association  and  the 
statistic  may  well  be  useful.  Certainly  one  objective  of  a  retrieval  system 
C3r.  be  to  order  documents  according  to  their  probable  relevance  to  the 
request  and  this  ordering  possibly  need  be  only  approximately  correct.  As 
a  matter  of  note,  so  long  as  the  determination  of  "degree  of  relevancy"  is 
subjective  and  not  assigned  an  empiric  value,  the  evaluation  of  the  "relevance 
numbers"  by  which  documents  are  ordered  is  itself  subjective.  The  important 
factor  may  not  be  the  relevance  number  itself,  but  the  fact  that  the  documents 
most  likely  to  be  pertinent  are  grouped  roughly  at  the  top  of  the  list. 

D.  THESAURUS  STRUCTURE,  INDEXING  STAN  HAROS  AND  ASSOCIATION  FACTORS 

The  study  of  association  factors  and  their  possible  uses  involves  con¬ 
sideration  of  many  factors;  of  great  importance — and  too  often  neglected  in 
analyses--is  the  data  base  of  document  de  crrptions  from  which  the  association 
factors  are  calculated;  they  car  be  no  better  than  the  index  terms  assigned 
to  documents.  This  section  discusses  the  large  class  of  associations  implicit 
in  the  organization  and  structure  of  the  thesaurus  and  suggests  a  general 
method  in  which  they  can  be  handled  efficiently. 

1 .  Hierarchal  Nature  of  a  Thesaurus. 

The  index  terms  in  the  thesaurus  form  a  hierarchy,  or  tree-iike  struc¬ 
ture,  branching  out  from  a  relatively  few  major  divisions  at  the  top  through 
a  varying  number  of  branch  points  or  nodes  down  to  «.he  most  del  ailed  terms 
at  the  bottom  of  the  inverted  tree.  The  number  of  levels  or  blanches  varies 
in  different  parts  of  the  tree,  as  does  the  number  of  terms  at  any  one  level. 

Once  the  tree  structure  has  been  established  and  the  relationships  of 
index  terms  defined  by  "links"  from  one  node  to  that  above  or  thise  below, 
it  is  possible  to  enter  the  tree  at  any  index  term  and  traverse  it  in  either 
direction  using  only  the  link  data.  This  can  be  done  by  a  computer,  provided 
the  linkage  data  are  included  in  the  thesaurus  made  available  to  it.  This 
possibility  has  several  important  implications  on  the  overall  design  and 
operation  of  the  retrieval  system,  in  addition  to  its  effects  on  index  term 
associations. 

a.  Implicit  Index  Term  Associations.  The  thesaurus  tree  immediately 

specifies  the  members  of  a  set  of  significant  index  term  associations.  A 

term  0  at  level  n  always  is  a  subset  of  the  next  hiaher  term  U  at  level  m. 
n  -  m 

Furthermore,  PlD  0  )  =  1.  Conversely,  U  always  includes  as  subsets  all  the 
nr  n  *  m 

67 


— *  -a--  j  ** 


wrist-* 


D  .  linked  directly  to  it,  but  usua’iv  o(0  .iO  )  <  1.  In  a  similar  manner, 

the  term  D  bidirectionally  linked  through  D  with  index  terms  at  still 

higher  levels.  All  of  these  index  term  associations  derived  from  the  the¬ 
saurus  tree  are  significant,  whether  or  not  any  particular  pair  meets  tests 
for  statistical  significance. 

b.  Lowest  Level  Indexing.  Only  the  lowest  level  or  most  detailed  term 
applicable  in  any  one  branch  need  be  assigned  to  a  document.  All  higher 
level  terms  of  more  general  meaning  can  be  assigned  automatically.  With 
manual  indexing,  this  not  only  saves  some  indexing  effort  and  input  data 
preparation,  but.  also — and  more  important — assures  that  these  higher-level 
terms  are  assigned. 

c.  Current  Indexing  Practices  and  Factor  Association  Studies.  Automatic 
assignment  of  tree-related  terms  assures  a  degree  of  uniformity  and  complete¬ 
ness  missing  in  every  operuive  document  retrieval  system  which  has  been 
examined.  For  a  number  of  perfectly  normal  reasons,  the  assignment  of  tree- 
related  terms  to  documents  is  quite  variable.  Sometimes  several  levels  of 
terms  in  one  branch  are  assigned;  at  others,  only  the  (presumably)  lowest 
level  term  applicable.  Spot-checks  of  document  descriptions  in  several  appli¬ 
cations  against  the  thesaurus  indicate  that  this  variability  is  commonplace. 

Although  these  spot-checks  are  fairly  few  in  number,  they  all  tend  to 
indicate  that  existing  files  of  document  descriptions  are  missing  an  unknown, 
but  possibly  quite  large,  number  of  implicit  term  associations.  Consequently, 
association  factor  studies  based  upon  an  existing  file  have  utilized  a  data 
base  known  to  be  (or  almost  certainly)  incomplete  in  a  critical  area  of 
interest-- the  associations  of  index  terms  in  a  given  small  subset  of  the 
thesaurus.  This  known  lack  of  coverage  casts  doubt  upon  the  validity  of  all 
association  measures  calculated  from  the  term  pairs  actually  present. 

2.  Synonymous  Index  Terms. 

It  would  appear  that  the  principal  cause  of  synonymous  indexing  terms 
is  failure  to  recognize  that  a  new  term  already  is  included  in  the  definition 
of  another.  This  in  turn  may  be  more  common  when  the  thesaurus  does  not 
define  the  precise  meaning  or  scope  of  each  term,  but  leases  the  definition 
to  variable  human  interpretation.  Although  it  is  possible  that  two  synony¬ 
mous  terms  can  be  matched  because  of  significant  associations  with  a  common 
third  term  or  set  of  terms,  it  is  believed  that  thf?  feasibility  of  the 
method  has  not  been  established.  The  9DC  sample  contains  several  hundred 
thousand  matchings  of  two  terms  with  a  third,  few  of  which  are  synonyms,  and 
there  is  no  obvious  method  by  which  they  can  be  segregated.  J+  is  considered 
that  the  potential  use  of  association  measures  as  a  means  of  identifying 
synonyms  requires  more  justification  than  it  has  had  so  far. 

3 .  "General”  Indexing  Terms . 

Every  thesaurus  contains  a  number  of  indexing  terms  comparable  to  those 
in  00C  Group  292,  "General  Concepts” — Analysis,  Design,  Errors,  Measurement, 
Reliability,  Standards,  Tests,  Theory,  etc.  In  addition,  then  exist  a 
number  of  other  terms  of  very  general  meaning  and  wide  applicabili ty,  oi 
which  examples  are  Mechanical  Properties,  Physical  Properties,  Production, 
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Bibliography,  and  many  indexing  entries  in  the  field  of  mathematics. 

Finally,  terms  in  the  first  two  or  three  levels  at  the  "top  of  the  tree"  in 
a  major  division  or  field  of  interest  usually  are  fairly  broad  in  meaning. 

All  of  these  are  widely  used  in  indexing  documents.  In  the  DOG  sample, 
9C>%  of  the  documents  include  at  least  one  term  used  40  times  or  more  and  over 
half  include  terms  with  total  usages  of  over  1,000.  (There  are  only  15  of 
the  latter.)  These  percentages  would  be  even  greater  if  the  indexing  uni¬ 
formly  included  higher-level  terms  in  the  thesaurus  tree.  Their  very  popu¬ 
larity  of  usage  generates  a  large  number  of  pair  associations  of  which  they 
are  one  member  and  a  high  percentage  of  the  pairs  occur  often  enough--which 
may  be  three  times  or  less--to  have  "statistical  significance."  It  seems 
doubtful  that  many  ol  th-.m  have  any  practical  utility  in  a  document  retrieval 
system. 

The  "profile1*  of  almost  every  index  term  used  more  than  3-4  times  con¬ 
tains  several  of  these  general  terms.  The  chances  then  are  quite  good  that 
most  or  ail  of  the  terms  in  a  search  request  have  a  significant  association 
factor  with  some  cf  them,  which  may  be  used  to  expand  the  list  of  terms  upon 
which  the  search  is  made.  The  final  list  of  document  numbers  may  include 
many  which  are  completely  extraneous.  It  is  not  immediately  apparent  that 
an  article  on  "Penicillin"  is  germane  to  a  request  on  "Copper  Pipe"  merely 
because  both  have  a  high  degree  of  association  to  each  of  the  terns  "Test 
Equipment,"  "Quality  Control,"  "Standards"  and  "Production."  Conversely, 
an  article  on  "Lead  Pipe"  or  "Steel  Pipe"  well  could  be  relevant. 

It  appears,  then,  that  these  common  terms  either  should  be  eliminated 
as  generators  of  additional  terms  or  their  use  should  be  carefully  circum¬ 
scribed.  As  an  example,  the  terms  added  could  be  limited  to  those  contained 
in  the  same  divisional  thesaurus  tree,  or  a  part  of  it,  that  has  one  of  the 
narrower-meaning  terms  of  the  request.  This  procedure  requires  identifying 
and  earmarking  all  the  common  terms  to  be  restricted  in  usage,  as  well  as 
indicating  for  all  other  terms  the  thesaurus  tree  or  subtree  to  which  they 
belong.  The  precise  method  of  making  these  identifications  needs  to  be  es¬ 
tablished. 

E.  TIME-INTERVAL  SUBDIVISION  OF  ASSOCIATION  FACTORS 


The  principal  operative  use  of  measure  of  association  is  to  expand  an 
original  search  request  by  adding  to  it  other  terms  which  have  a  significant 
number  of  co-occurrences  with  terms  in  the  request,  or  its  first-order  expan¬ 
sion.  The  presumption  is  that  these  terms  will  isolate  otherwise  unobtain¬ 
able  documents  which  may  be  relevant  to  the  request.  Insofar  as  retrieval 
is  concerned,  this  is  considered  to  be  the  most  important  potential  use  of 
association  factors. 

A  document  file  is  a  dynamic  organism  and,  by  direct  extension,  so  is 
the  set  of  indexing  terms  and  their  associations.  New  terms  are  added  to  the 
thesaurus  as  new  meanings  or  definitions  are  introduced  into  the  fields  of 
interest  covered;  existing  terms  may  be  combined  or  subdivided  into  several 
r,ew  ones  to  reflect  the  changing  nature  of  documents.  New  associations  of 
terms  are  generated  as  previously  separated  areas  of  endeavor  become  wedded. 
These  changes  are  inherent  in  the  basic  data  upon  which  the  retrieval  system 
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operates.  In  addition  to  these,  procedure-dependent  changes  in  these  param¬ 
eters  are  introduced  by  the  normal  effort  to  improve  the  system's  effective¬ 
ness  and  responsiveness.  These  effects  probably  are  most  significant  during 
the  early  years  of  operation,  when  revisions  and  modifications  to  the  thesau¬ 
rus,  depth  and  type  of  indexing,  and  similar  factors  may  be  quite  extensive. 

This  question  arises  naturally:  Should  the  time  parameter  be  introduced 
as  a  variable  in  analyses  having  to  do  with  index  term  usage?  There  is 
considerable  indirect  evidence  that  this  is  highly  desirable  if  not  necessary. 
Although  it  is  generally  considered  that  reports  and  journal  articles  lose 
a  good  deal  of  their  value  after  five  years,  it  appears  that  most  information 
centers  will  retain  them. in  an  active  status  for  a  longer  period  of  time, 
possibly  ten  years.  If  the  time  parameter  is  not  introduced,  the  values  A, 

B,  N,  and  f  then  simply  are  totals  for  some  fairly  long  period  and  often  will 
not  reflect  short-term  changes.  There  may  be  nothing  particularly  significant 
for  f  =  10  if  A  =  200  and  B  =  300.  The  relationship  could  be  quite  signifi¬ 
cant  if  the  co-occurrences  took  place  within  a  10%  time  range  of  and 

It  is  precisely  this  sort  of  relationship  that  would  be  isolated  by  the  time 
parameter. 

Subdividing  the  file  of  index  term  usage  into  time  intervals  reduces  the 
values  A,  B,  and  N,  and  the  theoretical  frequency  f.  Because  the  latter 
already  is  very  small  for  most  pairs  of  index  terms,  its  further  diminution 
places  additional  pressure  on  developing  meaningful  measures  of  association. 
File  storage  also  increases,  because  now  it  is  necessary  to  accumulate  A,  B, 
and  f  within  each  time  interval.  It  is  concluded  that  a  complete  evaluation 
of  the  use  of  index  term  associations  requires  analysis  of  the  effects  of  the 
time  parameter.  So  far  as  known,  this  has  not  yet  bee.t  considered. 

F.  SIZE  OF  DOCUMENT  SAMPLES  FGR  ASSOCIATION  FACTOR  STUDIES 

Several  of  the  published  results  on  investigations  into  the  derivation 
and  use  of  association  factcis  have  been  based  upon  fairly  small  samples  of 
documents,  usually  less  than  about  500  and  limited  to  one  major  subject 
classification  of  the  library  used.  There  is  a  good  deal  of  doubt  as  to  the 
general  validity  of  these  small-sample  studies,  particularly  when  results 
are  to  be  extrapolated  to  an  entire  library.  At  least  three  different  factors 
contribute  to  this  uncertainty. 

The  first  is  that  the  complete  file  of  document  descriptions  generates 
a  multitude  of  small-magnitude  statistics.  Estimates,  based  upon  sample 
data,  cf  anything  more  than  general  characteristics  are  subject  to  quite 
large  standard  errors.  Experience  with  two  different  random  10%  samples 
(each  of  about  3,300  descriptions)  from  the  38.402-document  DDC  file  probably 
are  representative  of  these  uncertainties.  Estimating  the  number  of  different 
index  terms  in  the  full  file  from  a  sample  is  subject  tc  an  error  of  about 
<i0%.  Attempts  to  estimate  the  frequency  distribution  of  their  total  usage, 
based  upon  >he  usages  of  terms  included  in  the  sample,  have  been  largely 
unsuccessful,  except  for  the  15%  most  commonly  used.  Because  most  term 
associations  in  the  full  file  occur  fewer  than  ten  times,  the  samples  have 
been  of  little  value  in  studying  them.  Statistics  based  upon  only  a  few 
hundred  documents  seldom  will  be  representative  of  the  full  file. 
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only  the  documents  described  during  the  period  the  indexing  standards  of  the 
sample  were  followed.  They  almost  certainly  are  not  typical  of  earlier 
documents . 


Final ly- -;.nd  most  important — samples  limited  to  documents  in  one  sub¬ 
ject  classification  do  not  reflect  the  interactions  of  term  associations 
introduced  by  documents  in  other  classifications.  Again  referring  to  the 
DDC  data,  90$  of  the  different  pairs  and  85$  of  their  occurrences  involve 
terms  in  two  different  fields  of  interest.  The  typical  document  uses  terms 
from  several  groups  and  fields  and  the  existence  of  a  given  interfield  pair 
usually  gives  no  useful  clue  as  to  the  subject  classification  of  the  docu¬ 
ment.  It  is  considered  virtually  certain  that  association  factor  studies 
based  upon  single-subject  document  data  have  a  definite  bias  in  favor  of 
the  usefulness  of  the  results.  By  the  nature  of  the  sample,  all  terms  added 
in  the  first  and  second-order  cycles  must  lead  to  documents  pertaining  to 
the  one  subject  area.  One  would  expect  these  to  have  a  much  higher  average 
chance  of  being  relevant  to  a  request  than  documents  classified  under  other 
subjects.  Actual  operating  conditions  are  quite  different.  Here  the  values 
of  factors  used  in  the  term  association  formula  employed  are  determined  by 
total  library  usage,  as  is  the  calculated  measure  of  association,  and  the 
list  of  retrieved  documents,  with  or  without  relevance  numbers,  is  not  con¬ 
fined  to  those  pertaining  to  a  single  subject.  Any  proposed  use  cf  associa¬ 
tion  factors  must  be  adaptable  to  the  entire  library.  The  evaluation  of 
their  usefulness  in  retrieving  documents  likewise  must  be  based  upon  the 
total  operating  enviornment,  and  not  upon  a  nonrepresentative  subset  of  it. 

It  is  considered  that  representative  studies  into  index  term  associa¬ 
tions  and  their  use  must  be  based  upon  fairly  large  samples  selected  as  a 
roughly  random  cross-section  of  a  complete  document  library.  The  actual 
minimum  number  of  documents  required  is  rather  difficult  to  stipulate  and 
may  vary  somewhat  depending  upon  the  number  of  different  index  terms  end 
average  number  of  different  index  terms  which  have  been  assigned  per  docu¬ 
ment.  A  suggested  minimum  is  in  the  5-10,000  document  range,  with  the  entire 
file  used  if  it  is  less  than  about  20,000.  For  larger  files,  the  sample  may 
range  from  around  50$  of  the  documents  down  to  possibly  20$  for  files  of 
over  100,000.  Admittedly,  samples  of  this  size  involve  quite  large  volumes 
of  data  which  are  rather  expensive  to  process  and  this  cost  may  create  a 
severe  strain  on  limited-budget  research  studies.  On  the  other  hand,  unless 
the  cample  is  large  enough  to  generate  a  fairly  good  array  of  term  associa¬ 
tions,  test  results  may  have  limited  applicability,  and  perhaps  none,  to  an 
operative  system. 


G.  CONCLUSIONS 


Although  it  is  considered  that  index  term  associations  may  improve  the 
operation  of  a  document  retrieval  system,  it  is  concl'^cd  that  further 
research  is  necessary  to  establish  the  degree  of  improvement  which  may  be 
expected.  In  addition,  such  studies  should  take  into  account  the  file 
storage  and  data  processing  aspects  of  their  use. 
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It.  is  considered  desirable  to  distinguish  between  associations  implicit 
in  the  thesaurus  structure  and  term  definitions  on  the  one  hand  and  those 
based  simpl  upon  co-occurrence  in  usage  on  the  other.  Experimental  studies 
must  be  based  upon  large  samples  representing  a  full  cross-section  of  a 
library's  coverage  and  the  document  descriptions  must  form  a  complete  data 
base  within  the  structure  of  the  thesaurus,  correcting  the  deficiencies 
which  have  existed  in  an  unknown  degree  in  almost  all  studies  so  far  con¬ 
ducted.  Investigation  into  meaningful  measures  of  statistical  significance 
of  associations  should  be  pursued  and  the  usefulness  of  co-occurrences 
present  only  a  few  times  established. 
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Table  A-1A 

599  Moat  Common  DDC  Descriptors  With  Field  and  Group  Classifications 

(In  Sequence  hy  Frequency  of  Usage) 


Rina 

cirr. 

P.trx 

Fld/dip 

Descriptor* 

Dt  f  l\ 
Fulry 

(•Id,  -ir p 

Descriptor 

1 

571 

13  292 

Design 

76 

236 

15  187 

Spsctrog-apnlc  Analytic 

< 

579 

13  292 

Tests 

77 

121 

12  201 

Pathology 

3 

509 

15  147 

Mathematical  An&l/ais 

73 

283 

15  117 

Gases 

4 

542 

13  292 

Measurement 

79 

305 

15  117 

Thermodynamics 

5 

444 

01  114 

Guided  Missiles 

80 

310 

07  103 

Canada 

6 

491 

15  117 

Temperature 

81 

192 

02  227 

Search  Radar 

7 

335 

Oft  027 

Ai  r borne 

82 

222 

05  061 

Maintenance 

8 

418 

09  217 

Production 

33 

234 

15  18? 

Electromagnetic  Waves 

9 

492 

13  292 

Theory 

84 

213 

10  016 

Steel 

10 

443 

1C  K5 

Materials 

85 

256 

01  006 

Shock  Wavos 

n 

505 

13  292 

Analysis 

36 

238 

06  025 

Antennas 

12 

362 

06  027 

Surf  ace- to- Surface 

37 

302 

04  053 

Reduction 

13 

46c 

07  108 

Great  Britain 

83 

161 

14  048 

Chemical  ‘Warfare  Agents 

14 

437 

01  006 

Stability 

39 

194 

06  027 

X  Band 

15 

450 

13  292 

Effectiveness 

90 

190 

11  256 

Sheets 

16 

313 

02  183 

flight  Testing 

91 

241 

07  054 

Atmosphere 

17 

279 

02  227 

Radar  Equipaent 

92 

260 

10  212 

Plastics 

18 

445 

17  203 

Instrumentation 

53 

281 

15  187 

Absorption 

19 

464 

13  292 

Teat  .Methods 

94 

242 

15  187 

Reflection 

20 

393 

02  073 

Countsroeosures 

95 

194 

02  227 

Radar  Tracking 

21 

383 

11  216 

Pressure 

96 

137 

01  005 

Wind  Tunj»el  Models 

22 

368 

02  102 

Detection 

97 

aa 

10  056 

Coatings 

23 

271 

15  146 

Mechanical  Prooerties 

98 

135 

07  054 

Meteorological  Date 

a. 

436 

13  292 

Test  Equipaent 

99 

168 

01  006 

Drag 

25 

338 

01  010 

Control  Systems 

100 

150 

04  049 

Silicon 

26 

326 

09  217 

Processing 

101 

2ie 

13  060 

Data  Processing  Systems 

27 

274 

04  053 

Syntnesis 

102 

191 

30  099 

Liquid  Rocket  Propellants 

28 

362 

15  105 

Physical  Properties 

103 

201 

06  027 

Air- to- Air 

29 

196 

12  209 

Physiology 

104 

133 

08  223 

Acceptability 

30 

320 

15  117 

Heat  Transfer 

105 

246 

11  275 

Handling 

31 

275 

1'  076 

Circuits 

106 

176 

13  060 

Coding 

32 

410 

33  292 

Determination 

107 

171 

01  094 

Fighters 

33 

251 

04  053 

Chemical  Reactions 

503 

241 

13  292 

Configuration 

34 

232 

06  027 

Surf ace- to-Air 

109 

194 

Cl  114 

Guided  Missile  Trajectories 

35 

232 

15  247 

Stresses 

110 

185 

14  100 

Guided  Missile  Fuzes 

36 

283 

04  106 

Polymers 

111 

173 

15  066 

Crystal  Structure 

37 

245 

14  020 

Projectiles 

112 

151 

09  217 

Manufacturing  Methods 

38 

235 

01  006 

Aerodynamics 

1)3 

270 

17  136 

Test  Facilities 

39 

250 

16  057 

Combustion 

114 

174 

06  22? 

Radio  Equipment 

40 

238 

01  009 

Jet  Planes 

115 

234 

Oft  031 

Microwaves 

41 

313 

15  116 

Radiation  Effects 

116 

271 

13  29 2 

Control 

42 

298 

16  085 

Rocket  Motors 

137 

194 

lfc  219 

Rocket  Pro puis ion 

43 

269 

02  170 

Guidance 

118 

174 

15  029 

Molecular  Structure 

44 

307 

13  292 

Reliability 

119 

176 

12  209 

Growth 

45 

227 

1C  099 

Solid  Rocket  Propellants 

120 

174 

06  059 

Radio  Communication  System* 

46 

47 

48 

49 

50 

302 

293 

237 

309 

331 

15  187 
15  178 
01  005 
15  147 
01  006 

Propagation 

Scattering 

Model  Tests 

Statistical  Analysis 

Vibration 

121 

122 

123 

124 

125 

173 

223 

248 

10? 

201 

02  067 

15  247 
14  020 
63  C36 

13  060 

Anti-Aircraft  Defense  Syat arts 
Structures 

Explosives 

Biochemistry 

Programming 

51 

52 

53 

54 

55 

252 

214 

361 

255 

169 

15  066 
06  273 
13  071 
02  062 
0.1  006 

Crystals 

Trensistors 

Bibliography 

Storage 

Supersonic s 

126 

127 

128 

129 

130 

203 

262 

209 

152 

218 

07  054 
04  049 

07  054 
04  131 

15  105  . 

Climatic  Factors 

Metrology 

Kydridss 

Energy 

56 

57 

321 

193 

06  079 
06  082 

Electronic  Equipment 

Electron  Tubes 

131 

132 

196 

209 

06  059 

01  096 

Oajuolcntion  Systems 

Ge:  Flow 

58 

356 

01  009 

Aircraft 

133 

180 

15  147 

Probability 

59 

60 

239 

175 

5  5  247 
14  100 

Semiconductors 

fuzes 

134 

135 

207 

126 

09  217 

12  269 

FrcpwtUco 

Toxicity 

61 

284 

02  196 

Military  Requirements 

136 

215 

15  230 

Perliolaa 

62 

323 

15  148 

Velocity 

137 

243 

03  001 

Hexcrds 

63 

283 

13  060 

Digital  Computer* 

13S 

177 

06  027 

Shi  f.  born* 

64 

247 

04  131 

Gzidee 

139 

176 

01  006 

Boundary  iayor 

65 

297 

19  253 

Satellite  Vehicles 

140 

177 

07  105 

iritis  Stgions 

66 

215 

01  094 

Jet  Fighters 

343 

238 

15  148 

notion 

67 

271 

15  076 

Electrical  Properties 

142 

230 

15  029 

68 

275 

13  292 

Sensitivity 

143 

aoo 

01  005 

Cyl.adrS-*!  Rxiias 

69 

224 

01  006 

Launching 

144 

224 

01  0C6 

Aeroiy-a.7;  <•  Sro’ina 

70 

270 

13  292 

Error* 

145 

243 

14  237 

71 

251 

14  090 

Vul nar ability 

1*6 

19* 

16  323 

‘it. 

72 

230 

13  060 

Computers 

147 

244 

01  is-i 

73 

301 

13  104 

Operation 

341 

JjV 

15  07»- 

Bo?  l  i 

74 

279 

10  160 

Me  tala 

H9 

<0f- 

15  ff?6 

75 

200 

15  247 

Deformation 

150 

139 

V>  131 

•fcsirvdynaascs 

73 


-rr'arngigp.iiiii 


5^9  Most  Common 


Table  A~1B 

OdC  Descriptors  Kith  Field 
(In  Sequence  by  Frequency 


and  Group  Classification 

nf‘ 

~  ~  w  •'-y''  - 


3aci 

Wff, 

?»lr* 

f-.i/ttrp 

Enxsriptor 

Hank 

u::s . 
EAi£2 

Fld/Orp 

Oocrjpior 

151 

091 

07  054 

geatns:'  Fcreiastlng 

<26 

238 

15  X*5 

Dansity 

152 

117 

n  ns 

Sane 

227 

199 

06  027 

AutrsaUc 

15> 

iC-5 

04  131 

FxtSOrtdea 

228 

*94 

07  103 

USS» 

154 

1SS 

<34  xn 

Cbiori4** 

229 

174 

15  211 

?l**as  pfiysit.* 

155 

193 

14  09* 

Penetration 

23C 

144 

15  237 

ini* aits  PidSaV.no  ?»u«rt>J 

156 

202 

10  146 

Corsnie 

231 

144 

06  02" 

7«7  Riga  Jvoquoncy 

157 

143 

Cf>  oz? 

S  Band 

232 

05S 

OS  223 

p*refcol«ar 

155 

203 

04  049 

(Vdrogen 

233 

225 

15  247 

jurfaca* 

159 

1*0 

03  119 

!b&ssi 

234 

1*6 

04  053 

Oxidation 

360 

159 

10  016 

AltiSlUUS  A lleyd 

235 

1*9 

13  706 

Photographic  Anal) ala 

161 

195 

13  249 

Acoustics 

236 

186 

04  267 

ion* 

162 

176 

15  £4 7 

EUa  tioliy 

23? 

195 

15  247 

Solid* 

lb? 

129 

12  209 

Inhibition 

238 

160 

06  081 

Signal-To-Kaiai  Hatlo 

164 

5.43 

13  040 

Data  7ran**ieeiun  Sfcr-itepta 

239 

159 

11  240 

Safaty  0*vlc»s 

167 

193 

02  *62 

Container# 

240 

196 

14  032 

tarsinal  Balliatic* 

166 

193 

15  230 

Oases*  nsjr> 

241 

164 

01  012 

/L'.rfraiaa* 

m 

170 

15  121 

Fluid  flow 

242 

165 

02  062 

Packaging 

loS 

172 

09  7X7 

5arro*ia<t 

243 

095 

15  H7 

foocVon* 

ite 

1.64 

«.  *22 

Asplifiara 

244 

053 

03  086 

Quyaas 

ins 

169 

06  02? 

At  r-t>- Surface 

245 

1S6 

15  133 

Ssuroa* 

i'll 

4lj 

13  104 

^ecifteatviSa 

iii 

ICO 

01  006 

Tranaanles 

171 

1*3 

01  041 

Sc-oSssrs 

247 

150 

0&  on 

High  Praqaancy 

173 

136 

«i  059 

Display  3vst»n# 

2*3 

184 

IS  076 

tlactroaagnailc  STfast* 

174 

200 

15  147 

tsbiai 

<43 

191 

15  117 

Tharsal  Radiation 

175 

159 

15  ua 

Tonsil*  Properties 

250 

119 

15  148 

^paeiric  Ispulat 

176 

13? 

06  022 

Mjcrovsro  Aspliris?n 

251 

11? 

02  073 

Ka oar  Jsaaing 

177 

115 

CO  09* 

PbC3 

252 

173 

15  28? 

gave  Transniaaion 

178 

ITS 

02  *06 

Thrust 

253 

1G6 

14  10* 

Radio  Proalaity  TVlaj 

177 

174 

04  0>3 

Dadoapps5.  tlon 

254 

122 

10  016 

Titaniun  illoya 

18> 

134 

•21  005 

ieKdyaehit  Configuration* 

253 

U? 

04  106 

Organic  Compounds 

isi 

cei 

1?  15+ 

256 

108 

06  032 

Traveling  Vfava  Tab* a 

182 

107 

02  105 

Tracking 

257 

133 

01  041 

Jet  Bwtlara 

133 

214 

15  137 

l.hterfsrcnrs 

255 

X?6 

IS  CF76 

Dlalac trios 

'34 

156 

Cl  *09 

S*liscpi«r* 

25? 

158 

13  292 

Standards 

1£5 

109 

01  006 

Ufl 

260 

085 

12  266 

Therapy 

186 

187 

*4  051 

Casnicsl  inniysla 

261 

163 

17  234 

SciantifiC  Raasarch 

1*7 

157 

06  *7? 

Hi  crouses  Sq-cip®srl 

262 

177 

13  292 

Calibration 

183 

161 

si 

load  Distribution 

263 

119 

10  158 

Fatigv*  (Hschanies) 

189 

?>? 

54  15? 

Aisx'.nua 

264 

188 

06  074 

Eleotronlc  Circuit* 

29* 

151 

C?  05i 

Wind 

265 

154 

09  217 

Oaterioratlon 

191 

157 

15  ?30 

Radioactive  Jsctspas 

266 

140 

10  158 

Shoes  Raaiatsoca 

192 

23* 

15  029 

lenizslloc 

267 

170 

06  031 

Radar  Xafleotioua 

193 

:<• 

04  049 

Bates  Coapounda 

268 

114 

10  146 

hinders 

194 

I6i 

02  163 

Aerial  f.sconr.si seuw c 

269 

136 

10  00* 

Seals 

195 

t*S 

15  029 

Electron* 

270 

129 

06  229 

Radio  Receivers 

196 

191 

15  2>J 

Radioactivity 

271 

144 

09  21? 

*gi<* 

197 

s  a 

In  US 

Araaasst 

272 

161 

10  079 

Rocket  Propallant* 

19S 

16fc 

Of 

Electrical  &juii>s»nt 

273 

169 

02  127 

Infrared  Detectors 

59V 

216 

14  090 

Oetoaatloo 

274 

153 

01  009 

Airplanes 

2*9 

135 

«■  232 

Recording  Diviner 

275 

121 

11  27  > 

Transportation 

501 

165 

14  G9C 

HUst 

276 

176 

15  076 

poxarlastlcn 

?0? 

14? 

04  131 

ferrites 

277 

189 

15  187 

light 

233 

112 

Of.  C31 

Radio  !»»r*a 

273 

123 

10  099 

Socket  Oxidlsers 

S04 

191 

Ox.  2i< 

Phosr  Ssippli** 

2?9 

<6S 

12  209 

Lif*  Svpactsney 

205 

175 

06  Oi¬ 

SUra  High  Protuaacy 

?i» 

168 

34  ?86 

Soidad  Missile  g&rheads 

206 

375 

ls  244 

aiSKsrlcos 

2?1 

173 

'.5  1*7 

Optics 

XT! 

172 

%j  14? 

Dirfjexifittal  Erections 

2S2 

154 

01  OQS 

Aircraft  Squlpwnt 

703 

174 

15  105 

Intensity 

2S3 

136 

16  08' 

Turbo*et  Engines 

239 

147 

06  032 

Diodes 

234 

173 

16  219 

Propulsion 

210 

151 

06  027 

Broedb«;<d 

265 

141 

01  006 

Turbxlenie 

211 

<25 

15  105 

Curfsoa  Proce-U*s 

286 

141 

15  14? 

Saspling 

212 

<46 

05  006 

Plight  fstha 

287 

198 

13  060 

Antiog  Cosputera 

213 

113 

01  003 

•'iaga 

238 

141 

04  018 

Aainaa 

214 

178 

14  *20 

2*9 

123 

15  187 

Visibility 

215 

206 

15  11? 

Csoiicf 

290 

117 

0?  or,  J 

R*d4r  Interception 

at 

212 

13  292 

Data 

291 

152 

15  187 

Infrared  Spectroscopy 

21? 

155 

14  261 

T»n«« 

252 

150 

1$  076 

Bectroaagneilc  Properties 

213 

152 

01  00b 

Supersonic  rice 

293 

104 

13  246 

Selection 

219 

110 

13  073 

Scheduling 

294 

205 

11  240 

Safety 

22:  s 

133 

1 4  050 

Higir-Explasivs  AssuciUon 

m 

032 

14  020 

Cartridge* 

?a 

3  57 

14  <59* 

Fir*  Sontrol  iysteej 

296 

140 

20  212 

LaM  nates 

222 

203 

07  *54 

aatar 

297 

155 

15  a: 

Sax  Ionisation 

223 

059 

03  098 

Preservation 

258 

14A 

15  U6 

Cor.  taxi  nation 

224 

137 

09  ?<V 

3fr«4  Treaiasst 

299 

123 

01  005 

Sodio*  of  Revolution 

225 

146 

I*  092 

Olas*  Ibxlites 

300 

21? 

15  237 

Attenuation 

Table  A- 1C 
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(In  Sequence  by  Frequency 


and  Group  Clsssif 3  cationo 
of  Usage) 


.*3  an. 

Dirr. 

J'airs 

Fld/G.-p 

Oejerlpv/ir 

Beak 

Onf. 

L6iT5 

Fid/ Grp 

descriptor 

JO! 

147 

15  366 

Single  Crystals 

376 

116 

08  202 

avia lien  Personnel 

0?9 

i4  100 

Projectile  Fuzes 

377 

114 

09  *17 

Quality  Centro! 

30? 

155 

07  0f4 

Upper  Atmosphere 

373 

133 

15  187 

nicllialion 

304 

100 

13  071 

Clcssifleatlor 

379 

123 

01  006 

Hypersonic  Flow 

305 

092 

03  243 

frinevlor 

330 

129 

06  025 

Radar  Antennas 

306 

163 

13  07? 

Instruction  Manuals 

331 

101 

10  099 

Propellant  Properties 

30".’ 

182 

19  261 

Re-entry  Aerodynamics 

382 

129 

10  159 

Metallurgy 

3Cj 

156 

01  GOfc 

Hypertonics 

353 

114 

15  etc 

Data  Storage  Syatecs 

309 

141 

16  173 

Rocxe*  Moto**  Pozzies 

384 

139 

U  04C 

Aerosols 

310 

125 

04  157 

Germanium 

;ej 

125 

10  182 

Luorl carts 

311 

104 

15  147 

fietrlx  Algebra 

336 

134 

12  150 

Identification 

312 

184 

16  039 

Exhaust  Gases 

387 

132 

Cl  227 

Doppler  Rada-; 

313 

079 

H  043 

0  Agents 

383 

136 

09  217 

Bonding 

3  U 

1  5? 

06  059 

Cc'ffzuniteUon  Equipment 

339 

187 

0?  05u 

Air 

315 

134 

15  066 

Hlcrostnmture 

390 

144 

15  12) 

An sorption 

31C 

166 

04  003 

Etnyleaes 

391 

216 

13  071 

Symposia 

317 

095 

15  247 

Creep 

392 

123 

10  C16 

Stainless  Steel 

312 

09? 

04  053 

folyaerlzaiios 

393 

131 

01  0C-6 

Jets 

319 

152 

19  251 

Atsosphere  Sjtry 

394 

138 

14  t  ;r. 

Cxwrior  frtli  i*'.(es 

320 

3  30 

15  029 

X  days 

395 

113 

09  21“ 

Casting 

323 

191 

19  253 

SpocssMps 

396 

138 

15  247 

Thin  Jilts 

322 

158 

15  249 

Seise 

397 

165 

Cl  00c 

itabilizrlion 

323 

091 

33  270 

Military  Training 

398 

124 

02  227 

Cedar  Deceive; s 

324 

10-1 

10  153 

Fracture  (Mechanics) 

<99 

140 

13  073 

Costs 

325 

112 

14  020 

Fio-Stabi  1  i  z-ti  Ammunition 

400 

072 

15  066 

•-.artz  Crystals 

326 

127 

15  215 

Snocx  Tubes 

401 

087 

15  147 

Partial  Dlffarentlel  Equations 

327 

13? 

10  099 

Jet  Erglno  Fuels 

402 

106 

15  076 

Magnetic  Proper* les 

3*8 

150 

01  114 

Guldtd  Missile  Noses 

•ai 

134 

G6  079 

Electronic  Systems 

329 

105 

04  193 

Hydraxlnos 

404 

C*37 

01  006 

Laminar  Boundary  Layer 

330 

182 

04  04? 

Nitrogen 

405 

152 

34  0?2 

Ballistics 

331 

106 

02  102 

Detectors 

406 

114 

01  255 

Transport  Planes 

332 

183 

16  247 

Conductivity 

407 

116 

10  145 

Refractory  Materials 

333 

111 

H  100 

Arming  Harlots 

/OS 

091 

01  005 

TVlenguIar  oing* 

334 

111 

04  193 

‘Jrettiunee 

4)39 

157 

15  249 

Transducers 

335 

life 

19  251 

Setalllto  VeMclo  Trajectories 

410 

094 

01  006 

Stability  (Longitudinal) 

336 

150 

15  247 

Grad!  at  ion  Damage 

411 

036 

06  082 

Magnetrons 

in 

158 

04  106 

Liquids 

412 

14  5 

15  143 

Tcnact  Shock 

238 

144 

19  099 

Fuels 

413 

168 

01  006 

High  Altitude 

33? 

102 

07  045 

Mapping 

414 

120 

06  075 

Electrodes 

340 

037 

12  023 

Tlssuss  (Biology) 

415 

056 

16  057 

Combustion  Chambers 

541 

143 

96  027 

Radio frequency 

416 

143 

02  180 

Atomic  Bomb  Explosions 

342 

J8C 

17  234 

High  Teapereture  Rseoerch 

417 

152 

11  282 

Vehicles 

543 

156 

10  153 

Failure  (Mechanics) 

418 

131 

02  227 

Radar  Targets 

544 

110 

K  020 

Antitank  ABsainli  Ion 

419 

179 

15  117 

Hasting 

345 

133 

15  0o6 

X-Ray  Diffraction  Analysis 

420 

135 

15  122 

PTosion 

346 

138 

15  121 

Viscosity 

421 

138 

15  076 

Dielectric  Properties 

347 

124 

04  049 

Nitrogen  Compounds 

422 

118 

15  ICO 

Electron  teams 

348 

141 

04  207 

Purification 

423 

092 

15  116 

Dose  Rat a 

349 

132 

15  066 

Lattices 

424 

148 

04  Oil 

Separation 

350 

094 

10  056 

Corrosion  Innlbltlon 

425 

175 

02  227 

Radar 

351 

053 

03  223 

Attitudes 

426 

136 

02  170 

Navigation 

352 

193 

15  187 

Infrared  Radiation 

427 

009 

09  288 

•aiding 

353 

120 

14  236 

earbetds 

428 

140 

04  131 

Sulfides 

354 

125 

14  100 

Proclsdiy  Fuze* 

429 

138 

04  049 

Sodium  Coapound* 

355 

117 

04  131 

Perchlorates 

430 

08? 

15  225 

Quantum  Mechanics 

356 

114 

0?  054 

Moisture 

431 

04  131 

filtrates 

35? 

150 

06  216 

Generators 

432 

104 

10  145 

Ferromagnetic  Materials 

35« 

133 

15  107 

Frequancy 

433 

119 

08  202 

Military  Personnel 

359 

151 

01  006 

W.rxl  T-nnela 

434 

105 

14  165 

Mines 

360 

104 

06  02? 

L  land 

435 

069 

08  223 

Learning 

361 

092 

13  073 

Industrial  Production 

436 

052 

12  266 

Diet 

362 

14? 

15  1.53 

Impurities 

437 

136 

04  049 

Alumina*  Compounds 

363 

143 

10  112 

Glass 

438 

092 

15  2’9 

Underwater  Sound 

.164 

133 

06  166 

Frequency  Modulation 

439 

105 

3  5  211 

Magna to hydrodynaciea 

365 

122 

04  190 

Methyl  Radicals 

440 

060 

04  207 

Dehydration 

366 

134 

15  065 

Liquefied  Gases 

441 

215 

01  006 

Simulation 

367 

118 

07  054 

Ionosphere 

442 

081 

14  048 

1  Agents 

368 

121 

01  005 

Control  Surfaces 

44 3 

076 

01  005 

Vlng-Bcdy  Configurations 

369 

113 

06  274 

W»7e  Guides 

444 

114 

15  117 

Phase  Studies 

370 

113 

17  080 

Test  Ssts 

445 

134 

06  027 

Low  Frequency 

371 

172 

14  032 

Range 

446 

127 

10  09? 

Rocket  Fuels 

372 

130 

10  099 

Propellant  Grains 

447 

113 

06  081 

Radio  Tnterferenre 

373 

115 

15  147 

Kune, -leal  Analysis 

448 

105 

15  029 

Molecules 

374 

137 

19  253 

Hrpersaloelty  Vehicles 

449 

096 

09  217 

Machining 

375 

119 

02  102 

Direction  Finding 

450 

138 

16  05? 

Flemas 

75 
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RcnF.  \ 

, 

ATT. 

PiiFJ* 

:i« 

Pld/Crp 

Descriptor 

Rani 

Birr. 

Fairs 

Fld/6rp 

Descriptor 

451 

10  146 

rile* 

526 

072 

;i  202 

Ctjo  Vehic’s* 

452 

105 

15  076 

Electrical  Kstwork* 

517 

108 

04  131 

Curbides 

45' 

C98 

34  020 

Arc or  Piercing  Ucminltloa 

518 

121 

01  1x4 

aerial  Targets 

454 

138 

15  200 

Targets 

529 

052 

15  147 

Statistical  Process*** 

455 

097 

C6  074 

Swivelling  Circuit* 

532 

lOo 

10  239 

Rubber 

456 

133 

02  102 

Range  Hndlng 

531 

146 

Cl  114 

Recover-* 

457 

1« 

02  183 

leading 

522 

138 

06  197 

Oscillators 

458 

132 

15  148 

Friction 

533 

203 

15  0 fit 

1 upedance 

459 

132 

01  OOo 

Dempi.l*; 

534 

14*' 

15  076 

Qactric  Plaids 

itC 

114 

31  005 

Conical  Eodlee 

535 

150 

16  148 

Dynamics 

461 

129 

13  246 

Coni'erincs* 

536 

M6 

04  190 

Vinyl  Red  leal  is 

462 

114 

15  m 

Thsrmal  Stress** 

537 

062 

08  223 

RsMoniEg 

461 

087 

10  239 

amthetlc  Rubbnr 

538 

:c: 

15  ll>6 

Porosity 

464 

123 

0?  054 

Ice 

539 

C19 

04  157 

tolybdemm 

465 

OKI 

Ui  006 

Flutter 

540 

095 

02  226 

llllltary  Equlpmsut 

466 

103 

08  270 

Training 

541 

097 

16  147 

Informatior.  Tneory 

467 

039 

13  104 

K*r«uversbl lit; 

542 

099 

13  104 

Eogin*arlng 

468 

120 

19  253 

Lunar  Prcbes 

543 

021 

06  032 

Backward-Jar*  Oscillator* 

469 

117 

04  08? 

Esters 

544 

or/ 

01  095 

Alrfolla 

470 

150 

07  028 

Earth 

5‘5 

063 

01  C09 

Vertical  Taka-off  Plana* 

471 

060 

12  209 

nutrition 

'46 

094 

O’  006 

Turbulent  Soundary  Layer 

472 

153 

10  145 

Insulating  Materials 

547 

322 

15  14/ 

P*rturoatloa  leeory 

473 

110 

02  170 

Inertial  Guldens* 

548 

)2fc 

Ci  049 

Carbon 

474 

122 

14  042 

Bombs 

649 

081 

15  147 

Series 

475 

100 

01  005 

Blunt  Bodies 

550 

116 

02  227 

fiadar  Eero  Areas 

476 

129 

06  02? 

Rrdiofrriuency 

>51 

102 

04  053 

Pyrolysis 

4‘TV 

106 

01  006 

rtOMntc 

552 

XOO 

10  212 

Epo*y  Resins 

478 

119 

15  2/,9 

Souvl 

553 

131 

04  061 

Chenistry 

479 

>30 

12  023 

S' In 

554 

094 

14  026 

Armor  Plate 

430 

\o 

18  244 

Ship* 

555 

143 

04  049 

Silicon  Coapounds 

481 

04° 

08  223 

Croup  Dynamic* 

556 

118 

06  031 

Radar  Signals 

482 

079 

07  103 

G«ogrf,f  ry 

55? 

077 

19  251 

Orbital  Flight  Paths 

48* 

114 

35  IS? 

uiffre;'-lcn 
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Table  A-3 

Pair  Associations  Among  the  599  Most  Common  DOC  Descriptors, 
Classified  by  Number  of  Different  Pairs  and  Total  Occurrences 


Total 

Pair 

)ccurrences 


100-199 
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Table  A-5 

Pair  Occurrences  of  the  50  Most  Frequently  Used  DDC  Descriptors 
(Selected  Summary  Data) 


1  cf 

500  Most  Common  Descr. 

Remaining  404)  Descriptors 

Descriptor 

No,  of 

Pairs 

Jescr. 

No. 

a ' 

/ 

Average 

No. 

1  1 

Average 

(frequency  of  Usage 

No.  of 

Used 

Used 

Total 

Used 

Pair 

Used 

Total 

Used 

Pair 

Sequence) 

Doc. 

Diff. 

Total 

With 

With 

Pairs 

With 

Occur. 

With 

Pairs 

With 

Occur. 

Design 

el93 

2660 

26364 

46.2? 

571 

10117 

21.4? 

31.7 

2008 

8247 

42.5 

3.0 

Test 

5237 

2t>42 

22280 

47.7 

570 

14070 

21.0 

25.o 

2063 

7310 

41.8 

3.5 

Mathemati cal  Analysis 

2470 

1472 

11424 

30.2 

500 

8230 

30.4 

16.3 

1163 

3144 

23.5 

2.7 

Measurement 

177(1 

1846 

0205 

33.3 

512 

6034 

29.4 

11,1 

1304 

3171 

26.4 

2.4 

Guideo  Missi les 

1701 

1125 

10457 

20.3 

444 

8500 

30.5 

19.4 

681 

1858 

13.8 

2.7 

Temperature 

1489 

1568 

7885 

28.3 

401 

5410 

31.3 

11.0 

1077 

2475 

21.6 

2.3 

Airborne 

1380 

000 

6078 

17.9 

385 

5391 

39.0 

14.0 

605 

1587 

12.2 

2.6 

Pr'  .action 

1212 

1239 

5100 

22.4 

413 

3366 

33.7 

0.1 

021 

1732 

16.6 

2.1 

meory 

120° 

1427 

5839 

25.8 

402 

4033 

34.5 

3.2 

035 

If. 06 

18.0 

!  ,o 

Materiels 

1155 

131 1 

5700 

23.7 

443 

3801 

34.1 

8.8 

868 

1000 

1?  6 

2. 2 

Analysis 

1113 

1410 

5035 

25.5 

505 

3505 

35.8 

6.9 

005 

1530 

18.3 

1.7 

Surface -to  -Surface 

1084 

715 

5557 

12.0 

362 

4766 

50.6 

13.2 

353 

yo  1 

7.1 

2.2 

Great  Britain 

1075 

1240 

4704 

22.4 

466 

3358 

37.6 

7.2 

774 

1346 

15.7 

1.7 

Stability 

104  i 

!  j.76 

5484 

21.2 

437 

3300 

37.2 

9.1 

730 

1404 

15.0 

2.0 

Effectiveness 

1C40 

1305 

4902 

23.6 

450 

3335 

34.5 

7.4 

855 

1567 

i7.3 

1.8 

Flight  Testing 

o2d 

708 

4728 

12.8 

313 

3746 

44.2 

12,0 

J05 

082 

8.0 

2.5 

Itadar  Equipment 

015 

658 

5582 

ll.o 

270 

4400 

42.4 

15.0 

370 

1173 

7.7 

3.1 

Inst  rumentat ion 

008 

1174 

4741 

21.2 

445 

3328 
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Table  A-b 

Summary  Statistics  of  599  Most  Common  OD C  Descriptors, 
Classified  by  Field  of  Interest 
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